1/jecolon/ziglyph v0.28

Unicode text processing for the Zig programming language.


ziglyph

Unicode text processing for the Zig Programming Language.

In-Depth Articles on Unicode Processing with Zig and Ziglyph

The Unicode Processing with Zig series of articles over on ZigNEWS covers important aspects of Unicode in general and in particular how to use this library to process Unicode text.

Looking for an UTF-8 String Type?

Zigstr is a UTF-8 string type that incorporates many of Ziglyph's Unicode processing tools. You can learn more in the Zigstr repo.

Status

This is pre-1.0 software. Although breaking changes are less frequent with each minor version release, they still will occur until we reach 1.0.

Integrating Ziglyph in your Project

Using Zigmod

$ zigmod aq add 1/jecolon/zigstr
$ zigmod fetch

Now in your build.zig you add this import:

const deps = @import("deps.zig");

In the exe section for the executable where you wish to have Zigstr available, add:

deps.addAllTo(exe);

Manually via Git

In a libs subdirectory under the root of your project, clone this repository via

$  git clone https://github.com/jecolon/ziglyph.git

Now in your build.zig, you can add:

exe.addPackagePath("ziglyph", "libs/ziglyph/src/ziglyph.zig");

to the exe section for the executable where you wish to have Ziglyph available. Now in the code, you can import components like this:

const ziglyph = @import("ziglyph");
const letter = @import("ziglyph").letter; // or const letter = ziglyph.letter;
const number = @import("ziglyph").number; // or const number = ziglyph.number;

Using the ziglyph Namespace

The ziglyph namespace provides convenient acces to the most frequently-used functions related to Unicode code points and strings.

const ziglyph = @import("ziglyph");

test "ziglyph namespace" {
    const z = 'z';
    try expect(ziglyph.isLetter(z));
    try expect(ziglyph.isAlphaNum(z));
    try expect(ziglyph.isPrint(z));
    try expect(!ziglyph.isUpper(z));
    const uz = ziglyph.toUpper(z);
    try expect(ziglyph.isUpper(uz));
    try expectEqual(uz, 'Z');

    // String toLower, toTitle, and toUpper.
    var allocator = std.testing.allocator;
    var got = try ziglyph.toLowerStr(allocator, "AbC123");
    errdefer allocator.free(got);
    try expect(std.mem.eql(u8, "abc123", got));
    allocator.free(got);

    got = try ziglyph.toUpperStr(allocator, "aBc123");
    errdefer allocator.free(got);
    try expect(std.mem.eql(u8, "ABC123", got));
    allocator.free(got);

    got = try ziglyph.toTitleStr(allocator, "thE aBc123 moVie. yes!");
    defer allocator.free(got);
    try expect(std.mem.eql(u8, "The Abc123 Movie. Yes!", got));
}

Category Namespaces

Namespaces for frequently-used Unicode General Categories are available. See ziglyph.zig for a full list of all components.

const letter = @import("ziglyph").letter;
const punct = @import("ziglyph").punct;

test "Category namespaces" {
    const z = 'z';
    try expect(letter.isletter(z));
    try expect(!letter.isUpper(z));
    try expect(!punct.ispunct(z));
    try expect(punct.ispunct('!'));
    const uz = letter.toUpper(z);
    try expect(letter.isUpper(uz));
    try expectEqual(uz, 'Z');
}

Normalization

In addition to the basic functions to detect and convert code point case, the Normalizer struct provides code point and string normalization methods. All normalization forms are supported (NFC, NFKC, NFD, NFKD.).

const Normalizer = @import("ziglyph").Normalizer;

test "normalizeTo" {
    var allocator = std.testing.allocator;
    var normalizer = try Normalizer.init(allocator);
    defer normalizer.deinit();

    // Canonical Composition (NFC)
    const input_nfc = "Complex char: \u{03D2}\u{0301}";
    const want_nfc = "Complex char: \u{03D3}";
    const got_nfc = try normalizer.normalizeTo(.composed, input_nfc);
    try expectEqualSlices(u8, want_nfc, got_nfc);

    // Compatibility Composition (NFKC)
    const input_nfkc = "Complex char: \u{03A5}\u{0301}";
    const want_nfkc = "Complex char: \u{038E}";
    const got_nfkc = try normalizer.normalizeTo(.komposed, input_nfkc);
    try expectEqualSlices(u8, want_nfkc, got_nfkc);

    // Canonical Decomposition (NFD)
    const input_nfd = "Complex char: \u{03D3}";
    const want_nfd = "Complex char: \u{03D2}\u{0301}";
    const got_nfd = try normalizer.normalizeTo(.canon, input_nfd);
    try expectEqualSlices(u8, want_nfd, got_nfd);

    // Compatibility Decomposition (NFKD)
    const input_nfkd = "Complex char: \u{03D3}";
    const want_nfkd = "Complex char: \u{03A5}\u{0301}";
    const got_nfkd = try normalizer.normalizeTo(.compat, input_nfkd);
    try expectEqualSlices(u8, want_nfkd, got_nfkd);

    // String comparisons.
    try expect(try normalizer.eqlBy("foé", "foe\u{0301}", .normalize));
    try expect(try normalizer.eqlBy("foϓ", "fo\u{03D2}\u{0301}", .normalize));
    try expect(try normalizer.eqlBy("Foϓ", "fo\u{03D2}\u{0301}", .norm_ignore));
    try expect(try normalizer.eqlBy("FOÉ", "foe\u{0301}", .norm_ignore)); // foÉ == foé
    try expect(try normalizer.eqlBy("Foé", "foé", .ident)); // Unicode Identifiers caseless match.
}

Collation (String Ordering)

One of the most common operations required by string processing is sorting and ordering comparisons. The Unicode Collation Algorithm was developed to attend this area of string processing. The Collator struct implements the algorithm, allowing for proper sorting and order comparison of Unicode strings. Aside from the usual init function, there's initWithReader which you can use to initialize the struct with an alternate weights table file (allkeys.bin), be it a file, a network stream, or anything else that exposes a std.io.Reader. This allows for tailoring of the sorting algorithm.

const Collator = @import("ziglyph").Collator;

test "Collation" {
    var allocator = std.testing.allocator;
    var collator = try Collator.init(allocator);
    defer collator.deinit();

    // Collation weight levels overview:
    // * .primary: different letters.
    // * .secondary: could be same letters but with marks (like accents) differ.
    // * .tertiary: same letters and marks but case is different.
    // So cab < dab at .primary, and cab < cáb at .secondary, and cáb < Cáb at .tertiary level.
    testing.expect(collator.tertiaryAsc("abc", "def"));
    testing.expect(collator.tertiaryDesc("def", "abc"));

    // At only primary level, José and jose are equal because base letters are the same, only marks 
    // and case differ, which are .secondary and .tertiary respectively.
    testing.expect(try collator.orderFn("José", "jose", .primary, .eq));

    // Full Unicode sort.
    var strings: [3][]const u8 = .{ "xyz", "def", "abc" };
    collator.sortAsc(&strings);
    testing.expectEqual(strings[0], "abc");
    testing.expectEqual(strings[1], "def");
    testing.expectEqual(strings[2], "xyz");

    // ASCII only binary sort. If you know the strings are ASCII only, this is much faster.
    strings = .{ "xyz", "def", "abc" };
    collator.sortAsciiAsc(&strings);
    testing.expectEqual(strings[0], "abc");
    testing.expectEqual(strings[1], "def");
    testing.expectEqual(strings[2], "xyz");
}

Tailoring With allkeys.txt

To tailor the sorting algorithm, you can create a modified allkeys.txt and generate a new compressed binary allkeys.bin file from it. Follow these steps:

# Change to the Ziglyph source directory.
cd <path to ziglyph>/src/
# Build the UDDC tool for your platform.
zig build-exe -O ReleaseSafe uddc.zig
# Create a new directory to store the UDDC tool and modified data files.
mkdir <path to new data dir>
# Move the tool and copy the data file to the new directory.
mv uddc <path to new data dir>/
cp data/uca/allkeys.txt <path to new data dir>/
# Change into the new data dir.
cd <path to new data dir>/
# Modifiy the allkeys.txt file with your favorite editor.
vim allkeys.txt
# Generate the new compressed binary allkeys.bin
./uddc allkeys.txt

After running these commands, you can then use this new allkeys.bin file with the initWithReader method:

const Collator = @import("ziglyph").Collator;

var file = try std.fs.cwd().openFile("<path to new data dir>/allkeys.bin", .{});
defer file.close();
var reader = std.io.bufferedReader(file.reader()).reader();
var collator = try Collator.initWithReader(allocator, reader);
defer collator.deinit();

// ...use the collator as usual.

Text Segmentation (Grapheme Clusters, Words, Sentences)

Ziglyph has iterators to traverse text as Grapheme Clusters (what most people recognize as characters), Words, and Sentences. All of these text segmentation functions adhere to the Unicode Text Segmentation rules, which may surprise you in terms of what's included and excluded at each break point. Test before assuming any results! There are also non-allocating compile-time versions for use with string literals or embedded files. Note that for compile-time versions, you may need to increase the compile-time branch evaluation quota via @setEvalBranchQuota.

const Grapheme = @import("ziglyph").Grapheme;
const GraphemeIterator = Grapheme.GraphemeIterator;
const SentenceIterator = Sentence.SentenceIterator;
const ComptimeSentenceIterator = Sentence.ComptimeSentenceIterator;
const Word = @import("ziglyph").Word;
const WordIterator = Word.WordIterator;

test "GraphemeIterator" {
    const input = "H\u{0065}\u{0301}llo";
    var iter = try GraphemeIterator.init(input);

    const want = &[_][]const u8{ "H", "\u{0065}\u{0301}", "l", "l", "o" };

    var i: usize = 0;
    while (iter.next()) |grapheme| : (i += 1) {
        try testing.expect(grapheme.eql(want[i]));
    }

    // Need your grapheme clusters at compile time?
    comptime {
        var ct_iter = try GraphemeIterator.init(input);
        var j = 0;
        while (ct_iter.next()) |grapheme| : (j += 1) {
            try testing.expect(grapheme.eql(want[j]));
        }
    }
}

test "SentenceIterator" {
    var allocator = std.testing.allocator;
    const input =
        \\("Go.") ("He said.")
    ;
    var iter = try SentenceIterator.init(allocator, input);
    defer iter.deinit();

    // Note the space after the closing right parenthesis is included as part
    // of the first sentence.
    const s1 =
        \\("Go.") 
    ;
    const s2 =
        \\("He said.")
    ;
    const want = &[_][]const u8{ s1, s2 };

    var i: usize = 0;
    while (iter.next()) |sentence| : (i += 1) {
        try testing.expectEqualStrings(sentence.bytes, want[i]);
    }

    // Need your sentences at compile time?
    @setEvalBranchQuota(2_000);

    comptime var ct_iter = ComptimeSentenceIterator(input){};
    const n = comptime ct_iter.count();
    var sentences: [n]Sentence = undefined;
    comptime {
        var ct_i: usize = 0;
        while (ct_iter.next()) |sentence| : (ct_i += 1) {
            sentences[ct_i] = sentence;
        }
    }

    for (sentences) |sentence, j| {
        try testing.expect(sentence.eql(want[j]));
    }
}

test "WordIterator" {
    const input = "The (quick) fox. Fast! ";
    var iter = try WordIterator.init(input);

    const want = &[_][]const u8{ "The", " ", "(", "quick", ")", " ", "fox", ".", " ", "Fast", "!", " " };

    var i: usize = 0;
    while (iter.next()) |word| : (i += 1) {
        try testing.expectEqualStrings(word.bytes, want[i]);
    }

    // Need your words at compile time?
    @setEvalBranchQuota(2_000);

    comptime {
        var ct_iter = try WordIterator.init(input);
        var j = 0;
        while (ct_iter.next()) |word| : (j += 1) {
            try testing.expect(word.eql(want[j]));
        }
    }
}

Code Point and String Display Width

When working with environments in which text is rendered in a fixed-width font, such as terminal emulators, it's necessary to know how many cells (or columns) a particular code point or string will occupy. The display_width namespace provides functions to do just that.

const dw = @import("ziglyph").display_width;

test "Code point / string widths" {
    // The width methods take a second parameter of value .half or .full to determine the width of 
    // ambiguous code points as per the Unicode standard. .half is the most common case.

    // Note that codePointWidth returns an i3 because code points like backspace have width -1.
    try expectEqual(dw.codePointWidth('é', .half), 1);
    try expectEqual(dw.codePointWidth('😊', .half), 2);
    try expectEqual(dw.codePointWidth('统', .half), 2);

    var allocator = std.testing.allocator;

    // strWidth returns usize because it can never be negative, regardless of the code points it contains.
    try expectEqual(try dw.strWidth("Hello\r\n", .half), 5);
    try expectEqual(try dw.strWidth("\u{1F476}\u{1F3FF}\u{0308}\u{200D}\u{1F476}\u{1F3FF}", .half), 2);
    try expectEqual(try dw.strWidth("Héllo 🇵🇷", .half), 8);
    try expectEqual(try dw.strWidth("\u{26A1}\u{FE0E}", .half), 1); // Text sequence
    try expectEqual(try dw.strWidth("\u{26A1}\u{FE0F}", .half), 2); // Presentation sequence

    // padLeft, center, padRight
    const right_aligned = try dw.padLeft(allocator, "w😊w", 10, "-");
    defer allocator.free(right_aligned);
    try expectEqualSlices(u8, "------w😊w", right_aligned);

    const centered = try dw.center(allocator, "w😊w", 10, "-");
    defer allocator.free(centered);
    try expectEqualSlices(u8, "---w😊w---", centered);

    const left_aligned = try dw.padRight(allocator, "w😊w", 10, "-");
    defer allocator.free(left_aligned);
    try expectEqualSlices(u8, "w😊w------", left_aligned);
}

Word Wrap

If you need to wrap a string to a specific number of columns according to Unicode Word boundaries and display width, you can use the display_width struct's wrap function for this. You can also specify a threshold value indicating how close a word boundary can be to the column limit and trigger a line break.

const dw = @import("ziglyph").display_width;

test "display_width wrap" {
    var allocator = testing.allocator;
    var input = "The quick brown fox\r\njumped over the lazy dog!";
    var got = try dw.wrap(allocator, input, 10, 3);
    defer allocator.free(got);
    var want = "The quick\n brown \nfox jumped\n over the\n lazy dog\n!";
    try testing.expectEqualStrings(want, got);
}

Package Contents

  • .gitattributes
  • LICENSE
  • build.zig
  • src/ascii.zig
  • src/data/ucd/GraphemeBreakTest.txt
  • src/data/ucd/UnicodeData.txt
  • src/data/ucd/Decompositions.bin
  • src/data/ucd/NormalizationTest.txt
  • src/data/ucd/WordBreakTest.txt
  • src/data/ucd/SentenceBreakTest.txt
  • src/data/uca/CollationTest_NON_IGNORABLE_SHORT.txt
  • src/data/uca/allkeys.txt
  • src/data/uca/allkeys.bin
  • src/data/license/UnicodeLicenseAgreement.html
  • src/data/license/standard_styles.css
  • src/normalizer/DecompFile.zig
  • src/normalizer/Normalizer.zig
  • src/normalizer/Trieton.zig
  • src/segmenter/CodePoint.zig
  • src/segmenter/Sentence.zig
  • src/segmenter/Word.zig
  • src/segmenter/Grapheme.zig
  • src/ziglyph.zig
  • src/display_width.zig
  • src/tests.zig
  • src/uddc.zig
  • src/tests/readme_tests.zig
  • src/autogen/word_break_property.zig
  • src/autogen/derived_general_category.zig
  • src/autogen/title_map.zig
  • src/autogen/derived_numeric_type.zig
  • src/autogen/emoji_data.zig
  • src/autogen/blocks.zig
  • src/autogen/canonicals.zig
  • src/autogen/derived_east_asian_width.zig
  • src/autogen/prop_list.zig
  • src/autogen/derived_combining_class.zig
  • src/autogen/case_folding.zig
  • src/autogen/derived_core_properties.zig
  • src/autogen/derived_normalization_props.zig
  • src/autogen/grapheme_break_property.zig
  • src/autogen/hangul_syllable_type.zig
  • src/autogen/sentence_break_property.zig
  • src/autogen/upper_map.zig
  • src/autogen/lower_map.zig
  • src/category/letter.zig
  • src/category/mark.zig
  • src/category/punct.zig
  • src/category/symbol.zig
  • src/category/number.zig
  • src/collator/Collator.zig
  • src/collator/CollatorTrie.zig
  • src/collator/AllKeysFile.zig
  • README.md
  • zig.mod
  • .gitignore

History

Published On Tree @ Commit Size
v0.33 Fri, 02 Dec 2022 14:57:26 UTC Tree 7.232 MB
v0.32 Fri, 25 Nov 2022 01:25:14 UTC Tree 10.144 MB
v0.31 Fri, 25 Nov 2022 01:18:53 UTC Tree 10.144 MB
v0.30 Sat, 19 Nov 2022 14:29:40 UTC Tree 10.202 MB
v0.29 Sat, 19 Nov 2022 02:09:25 UTC Tree 10.205 MB
v0.28 Sat, 19 Nov 2022 01:37:54 UTC Tree 10.204 MB
v0.27 Thu, 03 Nov 2022 11:57:54 UTC Tree 10.195 MB
v0.26 Tue, 06 Sep 2022 15:57:44 UTC Tree 10.434 MB
v0.25 Sun, 26 Dec 2021 13:25:45 UTC Tree 10.434 MB
v0.24 Sun, 26 Dec 2021 13:07:14 UTC Tree 10.434 MB
v0.23 Sun, 26 Dec 2021 12:50:27 UTC Tree 10.434 MB
v0.22 Fri, 24 Dec 2021 13:45:05 UTC Tree 10.457 MB
v0.21 Thu, 23 Dec 2021 16:03:13 UTC Tree 10.457 MB
v0.20 Thu, 23 Dec 2021 15:58:47 UTC Tree 10.457 MB
v0.19 Sat, 25 Sep 2021 00:29:44 UTC Tree 10.457 MB
v0.18 Wed, 22 Sep 2021 00:13:07 UTC Tree 10.457 MB
v0.17 Mon, 20 Sep 2021 10:50:08 UTC Tree 10.459 MB
v0.16 Fri, 17 Sep 2021 00:15:40 UTC Tree 10.457 MB
v0.15 Thu, 16 Sep 2021 01:30:12 UTC Tree 10.455 MB
v0.14 Tue, 14 Sep 2021 22:07:09 UTC Tree 10.251 MB
v0.13 Tue, 14 Sep 2021 01:29:25 UTC Tree 10.251 MB
v0.12 Fri, 27 Aug 2021 14:12:57 UTC Tree 10.251 MB
v0.11 Fri, 27 Aug 2021 14:10:18 UTC Tree 10.251 MB
v0.10 Fri, 27 Aug 2021 10:46:32 UTC Tree 10.263 MB
v0.9 Fri, 27 Aug 2021 02:20:32 UTC Tree 10.263 MB
v0.8 Fri, 27 Aug 2021 01:42:44 UTC Tree 10.263 MB
v0.7 Thu, 26 Aug 2021 16:01:47 UTC Tree 10.260 MB
v0.6 Thu, 26 Aug 2021 15:58:25 UTC Tree 10.260 MB
v0.5 Thu, 26 Aug 2021 15:57:17 UTC Tree 10.260 MB
v0.4 Thu, 26 Aug 2021 15:51:06 UTC Tree 10.260 MB
v0.3 Mon, 23 Aug 2021 10:55:09 UTC Tree 10.228 MB
v0.2 Mon, 23 Aug 2021 00:41:38 UTC Tree 10.228 MB
v0.1 Sun, 22 Aug 2021 23:46:25 UTC Tree 10.228 MB