kinogaki-codecs

kinogaki-codecs — audit issues

Audited 2026-06-20. Repo: libraries/kinogaki-codecs. Criteria: completeness · tests · separation of concerns · verb-named functions · file size (<1000 lines) · organization.

Health summary

A genuinely solid, well-factored codec library: six codecs (json, markdown, html, svg, text, blob) plus a bundle convention, all sharing a single document model, with a dependency-free JSON/XML/HTML parser stack and 28 tests / 234 checks that build clean (-Wall -Wextra) and pass. The most material problems are documentation drift (the README still presents five of the six codecs as future work) and a handful of honest-but-untested gaps: markdown nested lists are claimed in the header but silently unsupported, and the document/vector decoders never report errors (the ParseError* is ignored and decode always "succeeds"). No file approaches the size limit, there are no committed build artifacts, and separation of concerns is clean apart from two small copy-pasted helpers.

Issues

[HIGH] completeness — markdown nested lists are claimed but silently unsupported

Where: src/MarkdownCodec.cpp:220, src/MarkdownCodec.cpp:232; header claim include/kinogaki/codecs/MarkdownCodec.h:3
Problem: The header advertises "ordered/unordered lists incl. nesting", but list parsing is gated on indentOf(L[i]) == 0. An indented (nested) list marker never opens a sublist — its lines are absorbed into the parent item as continuation text, so - a\n - b\n loses the nested list entirely with no error. The claimed feature is not implemented, and no test exercises nesting.
Fix: Either implement indented sublist recognition (recurse parseBlocks on the de-indented item body — the item content already strips contentCol, so a marker at the item's indent should start a nested list), or remove "incl. nesting" from the header and add an explicit "nested lists are flattened" note plus a test pinning the actual behaviour.

[MEDIUM] completeness — document/vector decoders ignore ParseError and never fail

Where: src/MarkdownCodec.cpp:335, src/HtmlCodec.cpp:398, src/SvgCodec.cpp:194
Problem: All three decode() signatures take a ParseError* (per the Codec seam) but drop it on the floor — the parameter is unnamed and unused. They always return a Document, so malformed/truncated input (an unterminated <svg, a fence that never closes, garbage tags) yields a degraded document with no signal to the caller. Only JsonCodec::decode populates ParseError. This is the "error paths that silently swallow failures" smell from criterion 1.
Fix: Decide and document the contract. If these parsers are intentionally total (HTML/MD are lenient by design), say so in each header and consider not accepting a ParseError* you can't fill; otherwise detect at least gross failures (unterminated raw/code/quote spans, no root element for SVG) and report them.

[MEDIUM] organization — README documents only `json`; the other five codecs are presented as future work

Where: README.md:9-28 (esp. README.md:28)
Problem: The README's "Codecs" section describes only the JSON codec, then closes with "(More codecs — text, blob, csv, … — will join here…)". In fact markdown, html, svg, text, blob codecs and the whole bundle/filesystem convention are implemented, tested, and shipping. A reader (or agent) trusting the README would not know they exist. CANONICAL.md is accurate and current, which makes the README the stale artifact.
Fix: Rewrite the Codecs section to list all shipped codecs (json, markdown, html, svg, text, blob) and the bundle convention, each one line, pointing at CANONICAL.md for the schema contract and the per-format round-trip/loss rules.

[MEDIUM] tests — no malformed-input tests for markdown/html/svg; lossy round-trips asserted but not pinned

Where: tests/markdown_codec_test.cpp, tests/html_codec_test.cpp, tests/svg_codec_test.cpp
Problem: Only JSON has negative/garbage tests (tests/json_codec_test.cpp:89, :105). The document/vector codecs have no malformed-input cases (unterminated tags, unclosed fences/blockquotes, stray </close>, deeply nested input, a non-/svg root for SVG encode is tested but decode robustness is not). CANONICAL.md promises a tabulated cross-format degradation table where "each row is a test", but the md encode of doc-v1-only constructs (definitionList → bold term, figure → italic caption, note → blockquote, rawHtml verbatim — src/MarkdownCodec.cpp:311-328) has no test asserting the degraded output. UTF-8 multibyte / entity edge cases beyond &/< are untested in html.
Fix: Add malformed-input cases per codec (assert they don't crash and produce a defined result), and add the html→md degradation tests CANONICAL.md commits to (one per doc-v1 construct).

[LOW] separation — `safeSegment` duplicated across two codecs

Where: src/JsonCodec.cpp:35 and src/Bundle.cpp:50
Problem: Two near-identical "filename/key → safe path segment" routines. JsonCodec's fails closed on an unsafe key (correct for untrusted JSON); Bundle's sanitises and disambiguates. The shared concept (what makes a valid [A-Za-z0-9_] segment) is reimplemented, so a future change to the path charset must be made in two places.
Fix: Factor the "is this a representable segment" predicate into one shared helper (a small internal/ header), keeping the two distinct policies (reject vs. sanitise) as thin wrappers over it.

[LOW] separation — entity/escape helpers duplicated between HtmlCodec and SvgCodec

Where: src/HtmlCodec.cpp:18-64 and src/SvgCodec.cpp:24-60
Problem: appendUtf8, decodeEntities, escapeText, escapeAttr are copy-pasted verbatim between the two markup codecs. SvgCodec.cpp:23 acknowledges this ("kept local so the codec is self-contained"), so it is a conscious tradeoff, but it is still duplicated XML-escaping logic that can drift.
Fix: Optional — extract a shared XmlText helper header. If self-containment is preferred, leave a cross-reference comment in both files so a fix to one is mirrored.

[LOW] completeness — JSON `\u` surrogate pairs not combined (documented, untested)

Where: src/Json.cpp:196
Problem: The string parser decodes each \uXXXX independently and never combines a high/low surrogate pair into the astral code point; a comment admits "surrogate pairs aren't combined here; lone units encode as-is". Emoji and other supplementary-plane characters written as surrogate pairs decode to invalid UTF-8. No test covers \u escapes at all (the escape test at tests/json_codec_test.cpp:66 uses only \n\t\"\\).
Fix: Combine surrogate pairs (0xD800–0xDBFF followed by 0xDC00–0xDFFF) into the 4-byte UTF-8 form, and add a 😀-style round-trip test. If out of scope for v1, keep the comment but add a test pinning the current behaviour so it's a known, not a surprise.

[LOW] naming — function names are clean

Where: src/ (all files)
Problem: None material. Free functions read as verbs (escapeTo, numberTo, dumpTo, parseInline, parseBlocks, renderBlocks, buildChildren, writeElement, materializeFile, addFile, addFolder, graft). A few helpers are predicate-/noun-shaped — attr, lower, root, standalone, safeSegment, docCodec, svgCodec, entries, fileName, leavesOf, childrenOf — but these are accessors/queries/factories where the noun form reads naturally and matches the surrounding codebase style; bool predicates are appropriately is*/blank/contains. Not worth churning.
Fix: None required.

[LOW] filesize — clean

Where: src/
Problem: Largest source file is src/HtmlCodec.cpp at 413 lines; next is src/MarkdownCodec.cpp at 348. Nothing exceeds 800, let alone 1000.
Fix: None required.

[LOW] organization — no committed build artifacts; layout is clean

Where: repo root
Problem: None. build/ and .DS_Store are git-ignored (.gitignore), no .o/.a/_site is tracked, headers mirror sources, tests mirror codecs, both build.sh (clang loop) and CMake are present and consistent. The only orphan-adjacent note is the stale README (covered above).
Fix: None required.

‹ Core Platform ›