This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

1: Getting Started

1.1: Install

2: Guides

2.1: Coercing numbers
2.2: Rendering data
2.3: SchemaProjector
2.4: SchemaCommandLine

3: Reference

DataSchemer (DS) is a lightweight, schema-driven library for parsing, validating, and typing structured user input into clean, well-defined Python data structures.

DS is designed to sit at the boundary between user-facing interfaces (command-line tools, configuration files, simple APIs) and internal application logic. Schemas describe what data is expected and how it should be interpreted; DS handles type coercion, validation, defaults, and structural mapping in a consistent, reusable way. DS can naturally be employed within a simple script requiring user input, or a substantial software suite with a deep command line interface.

Key ideas include:

Declarative schemas expressed as plain Python dictionaries, with support for reuse and extension via inheritance.
Strict, predictable parsing, with optional support for simple mathematical expressions (basic arithmetic), evaluated explicitly and safely.
First-class CLI integration, including help text, validation, and tab completion
Minimal assumptions, making the core a useful utility independent of CLI applications.

DataSchemer serves as the backbone for user input in the Principia Materia software suite, but DS is a standalone utility that should be useful in other applications.

Citation

If DataSchemer is used in published academic work, please cite it. A recommended citation is provided in CITATION.cff (see repository).

1 - Getting Started

DataSchemer lets you define the shape of your input once, and then rely on it to parse, validate, and structure user input consistently across your application.

At its core, DataSchemer uses schemas to describe what input is expected and how it should be interpreted. From that description, it handles type coercion, validation, defaults, and structured results for you.

A first example

Suppose your program expects a set of lattice vectors and an optional scale factor. You would like to accept human-friendly input, but work internally with numeric data.

A simple example

from data_schemer.schema_projector import SchemaProjector

schema = {
  "variables": {
    "lattice_vectors": {
      "type": "float-matrix",
      "help": "3×3 lattice vectors",
    },
    "scale": {
      "type": "float",
      "optional": True,
      "default": 1.0,
      "help": "Scale factor applied to the lattice vectors",
    },
  }
}

input_data = {
  "lattice_vectors": """
    0 1 1
    1 0 1
    1 1 0
  """,
  "scale": "4.8",
}

result = SchemaProjector(schema, input_data)

print (result.data)

The output is

{
  'lattice_vectors': array([[0., 1., 1.],
                            [1., 0., 1.],
                            [1., 1., 0.]]),
  'scale': 4.8
}

Basic substitution and arithmetic is safely supported. A hexagonal lattice may be entered as

input_data_tri = {
  "lattice_vectors": """\
    a=2.23  c=5.8
    a*r3/2  a/2  0
    a*r3/2 -a/2  0
    0       0    c
  """,
  "scale": "1.0",
}

result = SchemaProjector(schema, input_data_tri)

print(result.data["lattice_vectors"])

yielding

[[ 1.93123665  1.115       0.        ]
 [ 1.93123665 -1.115       0.        ]
 [ 0.          0.          5.8       ]]

A command line interface can automatically be constructed from the schema as follows

# test.py
from data_schemer.schema_command_line import SchemaCommandLine 
scl = SchemaCommandLine(schema)

print (scl.data['lattice_vectors'])

Executing on the command line gives

$ python test.py --lattice-vectors '0 1/2 1/2 ; 1/2 0 1/2 ; 1/2 1/2 0'
[[0.  0.5 0.5]
 [0.5 0.  0.5]
 [0.5 0.5 0. ]]

where the text string is parsed using a hierarchy of delimiters. Help menus are automatically generated from the schema information.

$ python test.py -h

usage: tt.py [-h] [--lattice-vectors X] [--scale X]

options:
  -h, --help
                        show this help message and exit
  --lattice-vectors X
                        (required) 3×3 lattice vectors
  --scale X    Scale factor applied to the lattice vectors

1.1 - Install

DataSchemer can be installed using pip. Very soon, you will be able to install from PyPI using

pip install data-schemer

Until then, clone the repo https://github.com/marianettigroup/data-schemer . The install using pip, preferably using editable mode.

pip install -e .

2 - Guides

Practical, copy-paste usage examples for DataSchemer.

2.1 - Coercing numbers

DataSchemer is designed to accept human-friendly numeric input and reliably convert it into structured Python data. This is especially important for command-line tools, configuration files, and lightweight data formats, where users naturally write numbers as text, expressions, or simple tables.

The numeric coercion utilities live in coerce_numbers.py and can be used directly, or indirectly through schema-driven workflows.

The core idea

The coercion utilities take loosely structured numeric text and turn it into:

Python scalars (int, float, Fraction, …)
Python lists of numbers
Nested lists representing matrices

The input may contain:

arithmetic expressions
symbolic substitutions
multiple values separated by whitespace or punctuation
simple matrix layouts

The goal is to let users write what they mean, without forcing rigid syntax or file formats.

Import

The most commonly used entry point is:

from data_schemer.coerce_numbers import coerce_number_list_with_substitutions

This function always returns Python lists, never NumPy arrays.
Results can be trivially converted to NumPy arrays, or one may import coerce_array_with_substitutions, which returns a NumPy array.

Scalars: numbers as expressions

At the simplest level, a single number—written as text—is parsed and evaluated:

coerce_number_list_with_substitutions("3", dtype=int)
# -> [3]

Expressions are allowed, so users don’t need to precompute values:

coerce_number_list_with_substitutions("1/2", dtype=float)
# -> [0.5]

coerce_number_list_with_substitutions("3*4 + 1", dtype=int)
# -> [13]

Supported arithmetic includes:

addition and subtraction: +, -
multiplication and division: *, /
scientific notation: 1e-3, 2E+4
square roots using rN notation:

coerce_number_list_with_substitutions("r2", dtype=float)
# -> [1.4142135623730951]

This is intentionally small but predictable: enough power for scientific input, without becoming a general-purpose programming language.

Multiple values: lists

Most real inputs contain more than one number. Values can be separated by:

spaces
commas
semicolons
newlines

All of the following are equivalent:

coerce_number_list_with_substitutions("1 2 3", dtype=int)
# -> [1, 2, 3]

coerce_number_list_with_substitutions("1,2,3", dtype=int)
# -> [1, 2, 3]

coerce_number_list_with_substitutions("1; 2; 3", dtype=int)
# -> [1, 2, 3]

This makes the parser forgiving and easy to use in CLI contexts.

Matrices: nested lists

When separators imply rows (such as newlines or semicolons), the result becomes a nested list:

coerce_number_list_with_substitutions("1,0;0,1", dtype=int)
# -> [[1, 0], [0, 1]]

Newlines work naturally:

text = """
1 0 0
0 1 0
0 0 1
"""

coerce_number_list_with_substitutions(text, dtype=int)
# -> [[1, 0, 0], [0, 1, 0], [0, 0, 1]]

At this stage, the structure is purely Python lists—no NumPy assumptions are made.

Header substitutions: symbolic values

One of the most powerful features is header substitutions.

You can define symbols at the top of the input and reuse them below:

text = """
a=0.5 b=1/3
a, b
0.0, 2*a
"""

coerce_number_list_with_substitutions(text, dtype=float)
# -> [[0.5, 0.3333333333333333], [0.0, 1.0]]

How this works:

The first line defines substitutions (a, b)
These symbols are available in all subsequent expressions
Expressions are evaluated after substitution

This is especially useful for:

lattice vectors
parameterized matrices
avoiding repeated numeric constants

Exact arithmetic with `Fraction`

By default, numeric expressions are evaluated using floating-point arithmetic. If you need exact rational values, you can request Fraction explicitly:

from fractions import Fraction

coerce_number_list_with_substitutions("1/3", dtype=Fraction)
# -> [Fraction(1, 3)]

This applies consistently to expressions and substitutions:

text = """
a=1/3
2*a
"""

coerce_number_list_with_substitutions(text, dtype=Fraction)
# -> [Fraction(2, 3)]

This is particularly useful in symbolic or group-theoretical contexts where exact ratios matter.

What this function guarantees

Output is always a Python list (possibly nested)
All numeric values are of the requested dtype
Expressions are evaluated deterministically
Substitutions are scoped to the input block
No NumPy dependency is introduced at this level

When to use numeric coercion directly

Use these utilities directly when you are:

parsing numeric text files
accepting numeric expressions from users
handling CLI arguments that represent vectors or matrices
preprocessing data before schema validation

In schema-driven workflows, you typically won’t call this yourself. SchemaProjector invokes numeric coercion automatically when a schema variable is declared as numeric.

Design philosophy

The numeric coercion layer is intentionally:

permissive in input syntax
strict in output structure
predictable and reproducible
free of side effects

It acts as a bridge between human-readable numeric text and strongly-typed Python data—without imposing unnecessary ceremony.

2.2 - Rendering data

DataSchemer includes small, focused utilities for rendering Python data into readable, YAML-compatible text. These tools are intended for human-facing output: command-line tools, reports, and snippets that users may want to copy directly into configuration files.

The rendering utilities live in render_data.py.

They are not a full serialization system. Instead, they provide:

stable, predictable formatting
readable alignment for numeric data
minimal YAML-compatible syntax
no implicit I/O or file handling

Design goals

The rendering layer is intentionally conservative.

It aims to:

produce output that is easy to read in a terminal
remain valid YAML when pasted into a file
avoid introducing YAML features that obscure structure
keep formatting decisions explicit and reproducible

For full serialization, schema validation, or round-tripping, standard YAML libraries should be used instead.

Overview of the API

The public rendering API consists of three main functions:

render_variable
render_array
render_mapping

These functions are designed to compose cleanly: render_variable dispatches to render_array or other formatters as needed, while render_mapping emits full YAML-style mappings.

render_variable

render_variable(variable, variable_type=None, *, prec=8)

Render a single Python value according to a detected or explicitly provided output type.

This is the primary entry point used by DataSchemer command-line tools when emitting results.

Behavior

Detects an appropriate output type if none is provided
Dispatches to the correct formatter (scalar, array, list, or mapping)
Produces YAML-compatible text
For array-like values, forwards numeric precision

Example

from data_schemer.render_data import render_variable

render_variable(300)

Output:

Arrays are rendered using aligned, readable formatting:

import numpy as np

render_variable(np.array([1, 2, 3]))

[1, 2, 3]

render_array

render_array(array, *, style="clean", prec=8, indent=None)

Render a one- or two-dimensional array as aligned, readable text.

This function accepts:

NumPy arrays
nested Python lists or tuples
array-like objects convertible to NumPy

Parameters

style: "clean" (bracketed YAML-style) or "bare" (rows only)
prec: decimal precision for floating-point values
indent: optional indentation (number of spaces)

Formatting rules

Arrays are rendered using bracket notation in "clean" style
Rows are aligned vertically for readability
Floating-point values are trimmed of insignificant zeros
Negative zero is suppressed (-0.0 → 0)
Output is valid YAML

Example: vector

import numpy as np
from data_schemer.render_data import render_array

render_array(np.array([0.5, 0.0, 1.0]))

[0.5, 0, 1]

Example: matrix

matrix = np.array([
  [0.5, 0.5, 0.5],
  [0.0, 0.0, 0.0],
  [0.5, 0.0, 0.0],
])

render_array(matrix, indent=2)

  [[0.5, 0.5, 0.5],
   [0  , 0  , 0  ],
   [0.5, 0  , 0  ]]

The formatting prioritizes visual alignment over compactness.

render_mapping

render_mapping(data, data_types=None, *, prec=8, array_style="clean")

Render a dictionary as a YAML-style mapping.

Behavior

Keys are rendered in insertion order
Values are rendered using render_variable
Multi-line values are indented
Arrays inherit precision and array style settings

Example

from data_schemer.render_data import render_mapping
import numpy as np

data = {
  "positions": np.array([
    [0.5, 0.5, 0.5],
    [0.0, 0.0, 0.0],
  ]),
  "lattice_constant": 3.905,
}

print(render_mapping(data))

Output:

positions:
  [[0.5, 0.5, 0.5],
   [0  , 0  , 0  ]]
lattice_constant: 3.905

YAML compatibility

The output produced by the rendering utilities is:

syntactically valid YAML
intentionally minimal
free of tags, anchors, or advanced YAML constructs

This makes it suitable for:

copy-paste into configuration files
inclusion in documentation
inspection in terminal output

The rendering layer does not guarantee round-tripping back to the original Python object.

When to use render_data

Use the rendering utilities when you want:

human-readable numeric output
stable formatting for CLI tools
YAML-compatible snippets without full serialization
consistent array formatting across tools

Do not use them when you need:

schema enforcement
automatic file writing
full YAML feature support
guaranteed round-trip fidelity

Relationship to SchemaCommandLine

In schema-driven workflows, these functions are typically invoked indirectly.

SchemaCommandLine uses render_variable and render_mapping to emit results in a consistent, user-facing format, ensuring that command output is both readable and reusable.

Summary

The rendering layer is intentionally small:

render_variable handles individual values
render_array handles numeric structure
render_mapping handles hierarchical data

Together, they provide a predictable bridge between Python data structures and human-readable, YAML-compatible text.

2.3 - SchemaProjector

SchemaProjector is the core validation and typing engine of DataSchemer.

It takes:

a schema definition (a dictionary)
raw input data (a dictionary, often containing strings or loosely typed values)

and produces validated, typed Python data.

Validation happens during construction: if anything is missing, unknown, or invalid, SchemaProjector(...) raises a DataSchemer user-facing exception.

This page documents the behavior that exists in the current implementation.

Basic example

from data_schemer.schema_projector import SchemaProjector

schema = {
  "variables": {
    "a": {"type": "int", "optional": False},
    "b": {"type": "float", "optional": False},
  }
}

raw = {"a": "1", "b": "2.5"}
p = SchemaProjector(schema, raw)

assert p.data["a"] == 1
assert p.data["b"] == 2.5

Validation model

Required vs optional

In DataSchemer, variables are required by default.

Required variables must be present in the input.
Optional variables are explicitly marked with optional: true.
A default may only be specified for optional variables. (If a required variable defines a default, schema loading raises an error.)

Defaults are applied before validation:

If an optional variable has a default and the user did not provide a value, the default is injected into raw_data.

Unknown keys are rejected

SchemaProjector is strict: unknown input keys are an error.

If input_data contains a key that is not declared in the schema’s variables, construction fails with DSUnknownNameError (including close-match suggestions).

This strictness is intentional: it catches typos early and keeps schemas authoritative.

Additional schema restrictions

SchemaProjector enforces:

requires: a variable may require one or more other variables to be present
conflicts: sets of variables where at most one may be defined
choices: restrict allowed values (including support for string-list)

Errors and what to catch

SchemaProjector raises user-facing DataSchemer exceptions from data_schemer.errors, including:

DSMissingRequiredError — required variables were not provided
DSUnknownNameError — schema name not found (in multi-schema form) or input contains unknown variable(s)
DSInvalidChoiceError — an input value is not in a variable’s choices
DSCoercionError — type coercion failed (includes variable name, raw value, expected type, and details)
DSUserError — general user-facing schema errors (e.g., conflicts, requires)

When embedding DataSchemer, it is usually sufficient to catch DSUserError (the base class), unless you want custom handling per subtype.

Schema forms: single vs multi-schema

Single schema (no name)

If the schema dictionary contains a top-level variables key, it is treated as a single schema and given the default name "default" (unless you pass an explicit name).

schema = {
  "variables": {
    "x": {"type": "int", "optional": False},
  }
}

p = SchemaProjector(schema, {"x": "3"})
assert p.schema_name == "default"

Multiple schemas with inheritance

A schema dictionary may contain multiple named schemas. A derived schema may list bases in inherit.

schema = {
  "base": {
    "variables": {
      "a": {"type": "int", "optional": False},
    }
  },
  "derived": {
    "inherit": ["base"],
    "variables": {
      "mode": {
        "type": "string",
        "optional": False,
        "choices": ["fast", "accurate"]
      }
    }
  }
}

p = SchemaProjector(schema, {"a": "2", "mode": "fast"}, "derived")
assert p.data["a"] == 2
assert p.data["mode"] == "fast"

Inheritance semantics (current behavior)

Base schemas are applied first; child schema variables override base variables of the same name.
Inheritance is recursive: bases may themselves inherit from other schemas.
If schema_definitions contains multiple schemas, schema_name is required; omitting it raises DSUserError.
If schema_name does not exist, DSUnknownNameError is raised with suggested schema names.

Note: the implementation treats inherit as an ordered list but does not explicitly define a conflict rule when multiple bases define the same variable name. In practice, the recursive update order determines the winner. If you rely on multiple inheritance, it is worth standardizing and documenting a precedence rule (or disallowing ambiguous overlaps).

Child-schemas (nested schemas)

Child-schemas let a schema declare named child schemas that appear as nested mappings in input and are projected recursively.

Declaring child-schemas

At the schema level, declare child-schemas using child_schemas:

root:
  variables:
    x:
      type: int
      optional: true
  child_schemas: [alpha, beta]

alpha:
  variables:
    a:
      type: int
      optional: true

beta:
  variables:
    b:
      type: int
      optional: true

child_schemas may be a string (single name) or a list of strings.
Child-schemas accumulate through inheritance (bases first, then child), with duplicates removed while preserving first occurrence order.

Input shape

Child-schema blocks are provided under their schema name and must be a mapping/dict:

raw = {
  "x": 1,
  "alpha": {"a": "2"},
  "beta": {"b": "3"},
}

p = SchemaProjector(schema, raw, schema_name="root")
assert p.data["x"] == 1
assert p.child_projectors["alpha"].data["a"] == 2
assert p.child_projectors["beta"].data["b"] == 3

If a child-schema key is present but not a dict (e.g. list or string), construction fails with DSUserError.

If a child-schema key is present with value None, it is treated as an empty mapping.

Recursive child-schemas (grandchild-schemas)

Child-schemas may themselves declare child_schemas, creating arbitrary depth:

root:
  child_schemas: [alpha]

alpha:
  child_schemas: [alpha_one]

alpha_one:
  child_schemas: [alpha_leaf]

alpha_leaf:
  variables:
    leaf:
      type: int
      optional: true

Input mirrors the nesting:

raw = {
  "alpha": {
    "alpha_one": {
      "alpha_leaf": {"leaf": "7"}
    }
  }
}
p = SchemaProjector(schema, raw, schema_name="root")
assert p.data_tree["alpha"]["alpha_one"]["alpha_leaf"]["leaf"] == 7

How child-schemas are processed

During construction:

Raw input is copied.
Keys matching declared child_schemas are carved out into raw_child_data and removed from the current-level raw data.
The remaining keys are validated as normal variables (variables).
Each child-schema gets its own SchemaProjector, recursively, using its carved-out mapping as input.

This keeps unknown-key checking strict at every level.

Accessing child-schema results

SchemaProjector exposes three related views:

child_projectors: {child_schema_name: SchemaProjector} (authoritative)
child_data: derived view {child_schema_name: projector.data}
data_tree: a recursive snapshot of typed data for the whole tree

Example:

p = SchemaProjector(schema, raw, schema_name="root")

# projector view (best when you need schema-level info)
alpha_p = p.child_projectors["alpha"]
print(alpha_p.schema_name, alpha_p.data)

# data view (typed dicts)
print(p.child_data["alpha"])

# full tree (typed nested structure)
print(p.data_tree)

Restrictions

To keep behavior well-defined:

A child-schema may not define required_file. If it does, schema loading raises DSUserError.
Every child-schema name in child_schemas must exist in schema_definitions, otherwise DSUnknownNameError is raised.

Searching for a schema name under a projector (advanced)

For tooling and introspection:

matches = p.find_projectors_with_paths("alpha_leaf")
# [(("alpha", "alpha_one", "alpha_leaf"), <SchemaProjector ...>)]

This returns all matches; it does not assume schema names are unique in the tree.

Variable definitions

A schema’s variables mapping associates each variable name with a definition dictionary. Common keys include:

type (required; see below)
optional (default: False)
default (optional variables only)
choices (list of allowed values)
requires (list of required companion variables)
code_alias (rename when constructing objects; see below)
help (Description of variables that may be used in documentation)
metavar (String passed to documentation)

Variable copy / update / delete / conflict (advanced)

In multi-schema definitions, the current implementation supports schema-level variable transforms:

copy_variables: define a new variable by copying an existing variable from another schema using "schema@var" syntax
update_variables: patch selected fields of existing variable definitions
delete_variables: remove variables from the final merged set
conflicts: sets of variables where at most one may be defined

These features are powerful, but they also introduce complexity. If you expect users to rely on them, add a short dedicated section with one concrete example for each.

Supported types

Built-in types are defined in SchemaProjector._type_definitions. The most commonly used are:

Scalars:

int
float
fraction (fractions.Fraction)
bool (accepts True/False or "true"/"false", case-insensitive)
string
none (identity)

Structured:

string-list (string → [string], list → list of strings)
dict-string-to-float-matrix
dict-string-to-float-vector

Numeric arrays:

float-vector, float-matrix (support expressions + substitutions)
int-vector, int-matrix
fraction-vector, fraction-matrix
string-vector, string-matrix

Vector/matrix shaping rules

Array-like types are coerced into NumPy arrays and normalized to shape:

vectors: always 1D (reshape(-1))
matrices:
- scalar → (1, 1)
- 1D → (1, N)
- 2D → unchanged
- higher-D → flattened into rows (reshape(-1, last_dim))

Arrays, substitutions, and named tokens

For float-vector and float-matrix, numeric coercion supports expressions and header substitutions (see numeric coercion docs).

In addition, SchemaProjector supports a small set of named array tokens for certain array types:

I → identity matrix (3×3)

Tokens may be prefixed by an integer multiplier:

schema = {"variables": {"m": {"type": "float-matrix", "optional": False}}}

SchemaProjector(schema, {"m": "I"}).data["m"]    # identity
SchemaProjector(schema, {"m": "2I"}).data["m"]   # 2 × identity

Token matching is currently:

only for specific array types (float-matrix, float-vector, int-matrix, int-vector, fraction-matrix, fraction-vector)
only when the raw value is a string
based on a leading integer multiplier + token name (e.g. "12I")

If you plan to add more tokens (e.g. O for zeros, E for ones, diag(...), etc.), list them here as they become public API.

Supplying additional choices at runtime

The constructor accepts an optional variable_choices mapping:

p = SchemaProjector(schema, raw, variable_choices={"mode": ["debug", "safe"]})

This extends (or adds) the schema choices for the specified variables. If variable_choices references an unknown variable name, DSUnknownNameError is raised (with suggestions).

This is useful when a higher-level tool wants to allow extra modes without editing the schema file.

Accessing results

SchemaProjector provides:

data — typed output (dict)
raw_data — raw input after defaults are applied (dict)
data_tuple — namedtuple view of data
schema_variables — resolved variable definitions for the chosen schema (merged with inherited variables)
schema_name — resolved schema name
conflicts — merged conflict lists (including inherited conflicts)

String renderings (YAML-style):

raw_data_string — render_mapping(self.raw_data)
data_string — render_mapping(self.data)

Aliasing and object construction

A variable may define code_alias, which renames it when instantiating a target class:

class Foo:
  def __init__(self, x):
    self.x = x

schema = {
  "variables": {
    "a": {"type": "int", "optional": False, "code_alias": "x"},
  }
}

p = SchemaProjector(schema, {"a": "5"})
obj = p.get_instance(Foo)

assert obj.x == 5

How `get_instance` works (current behavior)

The projector builds aliased_data: values are renamed according to code_alias where present.
get_instance(cls) inspects cls.__init__ and passes only those keys that match constructor parameter names.
Extra data keys are ignored for instantiation purposes (but unknown input keys are still rejected during projection).

This makes it easy to use the same schema both for validation and for building an object without manually wiring parameter names.

Extending domains

You can extend supported types and array tokens by creating a derived projector class:

Projector2 = SchemaProjector.domain_definitions(
  type_definitions={"double": lambda x: int(x) * 2},
  array_by_name={"Z": "0 0 0 ; 0 0 0 ; 0 0 0"},
)

schema = {"variables": {"a": {"type": "double", "optional": False}}}
p = Projector2(schema, {"a": "4"})
assert p.data["a"] == 8

domain_definitions(...) returns a subclass whose definitions extend the built-ins. If a name collides, the new definition overrides the old one.

2.4 - SchemaCommandLine

SchemaCommandLine is DataSchemer’s schema-driven command-line engine.

It combines:

argparse parsing (including optional argcomplete tab-completion)
YAML input files (optional)
an optional required positional file (per-command metadata)
SchemaProjector validation and typing
optional object instantiation (target_class)
and an optional print-attribute facility for introspecting results

Where SchemaProjector is the typing/validation core, SchemaCommandLine is the CLI orchestration layer.

High-level flow

When you construct a SchemaCommandLine, it performs all work immediately:

Resolve schema form/name (normalize_schema_inputs)
Resolve print_attribute configuration (with inheritance)
Build schema variables (including inheritance/copy/update/delete)
Build an argparse parser from schema variables
Parse CLI args (strictly; allow_abbrev=False)
Optionally load YAML input files
Optionally read a required positional file (and optionally parse it as YAML)
Merge all data sources into a single raw input mapping
Validate and type the data via SchemaProjector
Optionally instantiate target_class
Run _operations() (override point for custom logic)
Optionally print attributes requested via --print-attribute

This “do everything in __init__” approach keeps the class simple to embed: construct it once and you either get a successful run or a structured user error.

Basic example

The typical usage pattern is to subclass and set a few class attributes:

from data_schemer.schema_command_line import SchemaCommandLine

class MyCLI(SchemaCommandLine):
  pass

schema = {
  "variables": {
    "a": {"type": "int"},
    "b": {"type": "float"},
  }
}

# Equivalent of: MyCLI(schema_definitions=schema, argv=[...])
MyCLI(schema, argv=["--a", "1", "--b", "2.5"])

In practice, projects usually supply schema_definitions loaded from YAML, and set a target_class so a typed object is available in _operations().

Entry point: `main()`

SchemaCommandLine.main(..., debug: bool = False) -> int

main() is a convenience wrapper that:

returns a process exit code (0 on success)
catches DSUserError and prints a friendly message to stderr
suppresses tracebacks unless debug=True

This is the recommended entry point for console_scripts:

if __name__ == "__main__":
  raise SystemExit(MyCLI.main(schema, debug="--debug" in sys.argv))

Schema-driven argparse

SchemaCommandLine builds an argparse parser from the resolved schema variables using:

schema_to_argparse(schema_vars, description=..., masquerade_usage=True)

Key behaviors

Strict parsing: allow_abbrev=False to avoid ambiguous/accidental option matches.
No argparse-required: argparse does not enforce required arguments; the projector does.
Schema is the contract: help, metavar, aliases, choices, bool flags are schema-driven.
Improved UX: required variables get a (required) marker in colored help (TTY only).

Variable → CLI option mapping

A schema variable named foo_bar becomes:

--foo-bar

Aliases may be provided via alias:

if an alias starts with -, it is used as-is (short option)
otherwise it becomes --alias-name

Example:

variables:
  iterations:
    type: int
    alias: ["-n", "num_iterations"]

This yields options:

-n
--num-iterations
--iterations

Child-schemas

Child-schemas let a schema define a nested command grammar using bare tokens (not flags), with schema-driven options available at each level. The child-schema feature is designed to be:

deterministic (schema-declared ordering; no argparse subparsers)
strict (unknown tokens/options are errors)
nestable (grandchild-schemas and deeper)
completion-friendly (context-aware token/option completion when argcomplete is installed)

This section documents the current implementation behavior.

Declaring child-schemas

A schema may declare a list of child-schemas using child_schemas:

root:
  variables:
    # root options...
  child_schemas: [post_process, export]

post_process:
  description: Post-processing controls
  variables:
    level:
      type: int
      optional: true

export:
  description: Export controls
  variables:
    format:
      type: string
      optional: true
      choices: [json, yaml]

Notes:

child_schemas may be a string or a list of strings.
Child-schemas accumulate through inheritance (bases first, then child), with duplicates removed while preserving first occurrence order.
Child-schemas may themselves declare child_schemas, allowing arbitrary nesting depth.
Child-schemas must not define required_file (enforced by SchemaProjector schema validation).

CLI shape: bare tokens + options

Child-schemas appear on the CLI as bare tokens in kebab-case:

prog post-process --level 2
prog export --format yaml

Tokens are schema names rendered as:

schema name (snake_case): post_process
CLI token (kebab-case): post-process

Nested child-schemas (grandchild-schemas)

Example nesting:

root:
  child_schemas: [post_process]

post_process:
  child_schemas: [advanced]
  variables:
    level:
      type: int
      optional: true

advanced:
  child_schemas: [knobs]
  variables:
    adv:
      type: int
      optional: true

knobs:
  variables:
    k:
      type: int
      optional: true

CLI:

prog post-process --level 1 advanced --adv 2 knobs --k 3

`child_schema_mode`: `exclusive` vs `inclusive`

Each schema may set:

child_schema_mode: exclusive   # default
# or
child_schema_mode: inclusive

Meaning:

exclusive (default): once you descend into a child schema, you may only enter its descendants next. You cannot jump to a sibling token of the parent while inside the child.
inclusive: enables a one-level sibling jump at that schema level: while inside a child, you may jump to another child of the parent (the parent must be inclusive).

This is intentionally limited to one level; the parser does not walk up multiple ancestors to search for siblings.

Error messages for invalid tokens

When a user types a token that exists somewhere in the schema tree but is invalid in the current context, SchemaCommandLine raises a DSUserError that is explicit about categories:

what schema you were configuring,
that the invalid thing is a command token (not an option),
what you are allowed to do next (set options vs enter child-commands).

Example form:

Invalid command 'knobs' while configuring 'post_process'.

Here you may:
  • set options for 'post_process' (e.g. --level)
  • enter child-commands of 'post_process': advanced

Help listing for child-schemas

When child-schemas are present, help output includes a dedicated group listing tokens (not flags). The group label is configurable per schema via:

child_schema_label: Commands

Each listed token can also show a short description if the referenced schema provides a description field.

Tab completion (argcomplete)

When argcomplete is installed and active:

At a given point in the command line, completion suggests either:
- child-schema tokens valid in the current context, or
- options for the current context schema.
Completion is context-aware and supports arbitrary nesting depth.

Important: child-schemas are implemented without argparse subparsers. Completion uses a completion-only parser plus a grammar-aware context resolver, so you still get clean help output and deterministic behavior.

Data merging with child-schemas

Child-schema data may come from YAML input files and from CLI segments.

When both provide values for the same child-schema block:

YAML provides the baseline mapping for that block
CLI overrides keys inside that block

This behavior is implemented by merging nested mappings for each child-schema name before SchemaProjector validation/typing.

Data sources and precedence

SchemaCommandLine can merge values from up to two sources:

YAML input files (optional, via an “input files” variable in the schema)
CLI options (always available)

Precedence is:

YAML input files (lowest)
CLI options (highest)

This is implemented in _merge_sources():

raw_data = dict(self._input_file_data)
raw_data.update(self._cli_data)

So a user can keep defaults in YAML and override with CLI flags.

YAML input files

YAML input files are enabled by defining a schema variable (default name is input_files_tag, constructor default is "input_files"):

variable must exist in schema vars
value is expected to be a list of file paths (often via string-list)
each YAML file must parse to a mapping/dict
merged sequentially (later files override earlier)

stdin can be used by passing "-" as a YAML input file path (meaning read YAML from stdin).

Important restriction: you cannot use "-" for both required_file and YAML input files in the same invocation.

required_file metadata

A command may declare a required positional file via schema metadata:

required_file:
  enabled: true
  metavar: FILE
  help: Required input file (use '-' for stdin).
  read_mode: path      # path|text|binary
  stdin_ok: true

Behavior

If enabled: true, argparse adds a positional argument required_file.
How the file is consumed depends on read_mode:
- path: do not read content (only store the path); required_file_content is None
- text: read text content into required_file_content
- binary: read bytes into required_file_content

apply_as_schema_data

If apply_as_schema_data: true, the required file is always read as text and parsed as YAML. The resulting mapping is merged into schema input data at the “required_file” precedence level.

This is useful for commands where the primary input is a YAML block but you want it as a required positional rather than --input-file.

Accessors

required_file_path
required_file_content

Unknown options and suggestions

SchemaCommandLine parses with parse_known_args() to detect unknown options and provide better errors.

Unknown options beginning with - raise DSUnknownNameError with “did you mean” suggestions.
Unexpected positional arguments raise DSUserError.

Suggestions are computed using close matches over normalized option names (including aliases).

This is a major UX improvement over default argparse errors.

target_class and object instantiation

If target_class is provided, SchemaCommandLine will:

validate/type inputs using SchemaProjector
call SchemaProjector.get_instance(target_class)
expose the instantiated object via target_obj

Constructor kwargs are filtered by the target class __init__ signature (unknown kwargs ignored).

This makes schemas usable both for:

configuration validation
object construction

Extending behavior: `_operations()`

Override _operations() in subclasses to implement custom logic.

At the time _operations() runs:

self.data is available (typed dict)
self.target_obj may be available (if target_class provided)
required_file state is available

A common pattern is:

compute derived quantities
write output files
call domain-specific libraries

print-attribute

print_attribute is a special schema feature that enables a built-in CLI option:

--print-attribute ...

This option lets users request printing one or more @property attributes from the instantiated target object.

It is designed for:

interactive exploration
debugging / inspection
reproducible output (printed in YAML-style using render_mapping)

Requirements

print-attribute requires a target_class
print-attribute is enabled and configured by schema descriptor configuration (not per-variable)

When enabled, SchemaCommandLine injects a reserved variable into the schema:

variable name: print_attribute (reserved; users must not define it)
type: string-list
optional: true

This injection ensures argparse sees a --print-attribute option without requiring you to bake it into every schema by hand.

Where configuration comes from

SchemaCommandLine resolves print-attribute configuration via:

print_attribute = get_schema_print_attribute(schema_name, schema_definitions)

This supports inheritance and merging rules defined by the schema resolver.

The resolved value may be:

False / None: disabled
True: enabled, auto mode
a dict: enabled, with advanced configuration

Print-attribute modes

After resolution, SchemaCommandLine normalizes the configuration into one of two modes:

Auto mode

Auto mode derives choices by introspecting the target class:

all public @property names (including inherited)
excluding private names starting with _

Those internal property names are then mapped to external names presented on the CLI.

Manual mode

Manual mode uses an explicit list of external choices provided by schema configuration:

no class introspection is used to produce the list
values are still mapped to internal attribute names before reading from the object

Manual mode is selected when the resolved configuration dict contains a choices key and it is not null.

print-attribute: external vs internal names

This is the most important conceptual point:

The CLI accepts and displays external names only.
The underlying object is accessed using internal attribute names.

DataSchemer supports two mapping systems that may both apply:

Schema variable code_alias (general DataSchemer feature)
print_attribute.code_alias (specific to print-attribute)

Both contribute to mapping external → internal attribute names for print-attribute.

Schema `code_alias` interaction

Schema variable definitions may rename how constructor arguments map to the object:

variables:
  a:
    type: int
    code_alias: x

For print-attribute, SchemaCommandLine builds alias maps from schema vars:

internal → external
external → internal

and merges them with print-attribute’s explicit code-alias mapping (below).

print_attribute.code_alias

In dict form, print-attribute may define:

print_attribute:
  code_alias:
    external_name: internal_name

This affects only print-attribute resolution (not object construction).

Disjointness requirement

These two alias maps must be disjoint (no overlapping keys), otherwise it becomes ambiguous. The implementation validates disjointness using merge_disjoint_maps(...) and raises DSUserError if they overlap.

print-attribute: choices

In dict form, print-attribute can constrain and control the allowed --print-attribute values.

Manual choices

print_attribute:
  choices: ["energy", "volume"]

This:

forces manual mode
restricts CLI candidates strictly to those external spellings
drives tab-completion candidates (when argcomplete is installed)

If choices: null, then:

the configuration remains enabled
but it stays in auto mode
and (crucially) exclude is suppressed because the choices key is present (see exclude semantics below)

This “choices key present” rule is intentional: it lets inheritance signal “explicit configuration” even when the final list is not specified.

Runtime injection of choices for argparse

To support argparse choices and tab-completion, SchemaCommandLine dynamically injects the computed choices into the injected print_attribute variable only at parser construction time.

print-attribute: exclude

In dict form, print-attribute may define an exclude list:

print_attribute:
  exclude: ["debug", "internal_state"]

Exclude is applied only in true auto mode, meaning:

the effective dict does not contain a choices key (including inherited configs)

If choices is present (even as null), exclude is ignored by design.

Exclude works on external spellings only:

if you exclude "foo", it will remove both the internal and external names that match "foo" from the auto-derived list

Consistency between `choices` and `ext2int`

In auto mode, exclude removes entries from:

the completion/validation choices list
the external→internal mapping (ext2int)

This prevents excluded names from reappearing via alias mappings.

print-attribute: omit_key

In dict form, print-attribute may define an omit_key flag:

print_attribute:
  omit_key: true

If omit_key is true, SchemaCommandLine prints the raw attribute value (not as name: value mapping).

If omit_key is false (default), output is rendered using render_mapping({external_name: value}, ...).

print-attribute: output formatting controls

SchemaCommandLine also supports optional output-format controls for print-attribute:

precision_from
array_style_from

These are optional keys inside the print_attribute config dict. Each key names a schema variable. If the user supplies that variable, SchemaCommandLine uses it to control how print-attribute output is rendered.

`precision_from`

print_attribute:
  precision_from: precision

variables:
  precision:
    type: int
    optional: true
    help: Output precision for print-attribute

Behavior:

If the user supplies precision, SchemaCommandLine validates it as an integer and updates the instance precision.
If the user does not supply it, SchemaCommandLine keeps the constructor precision (default 8).

`array_style_from`

print_attribute:
  array_style_from: array_style

variables:
  array_style:
    type: string
    optional: true
    choices: ["clean", "bare"]
    help: Array style for print-attribute output

Behavior:

If the user supplies array_style, SchemaCommandLine passes it to render_mapping(..., array_style=...) for print-attribute output.
If the user does not supply it, SchemaCommandLine uses the default array-style behavior of render_mapping.

print-attribute and inheritance: practical mental model

Because schema-level print_attribute is inherited and merged, treat it like a small “policy” object:

A base schema can enable print-attribute broadly.
A derived schema can:
- add code-alias mappings
- extend choices
- add exclusions (in auto mode)
- add output-format controls (precision_from, array_style_from)
- or disable print-attribute entirely (false)

The resolver and normalizer together ensure that:

the CLI only ever exposes external spellings
printing uses render_mapping({external_name: value}, ...)
and mapping is deterministic.

How print-attribute output looks

When --print-attribute is used, SchemaCommandLine prints each requested attribute in YAML style:

energy: -12.345
volume: 60.0

Internally, it prints one mapping per attribute, so output remains stream-friendly.

If the external name cannot be resolved to an attribute on the object, a DSUserError is raised with details (including internal name).

Tab completion with argcomplete

If argcomplete is installed, SchemaCommandLine enables completion via:

argcomplete.autocomplete(parser)

For --print-attribute, completion candidates are the computed external choices.

The canonical helper for computing them is:

compute_print_attribute_choices(...)

Downstream tools (like PM) should call this helper rather than duplicating logic.

Summary

SchemaCommandLine provides a practical, batteries-included way to build CLIs from schemas:

schema → argparse options (+ help/aliases/choices)
strict parsing + helpful suggestions
merge YAML + required_file + CLI with clear precedence
validate/type using SchemaProjector
optionally instantiate an object
optionally print introspected attributes using a carefully designed external/internal mapping system

The print-attribute feature is intentionally rich because it lives at the boundary between schema naming, object naming, and user-visible CLI affordances.

3 - Reference

Sphinx generated documentation from docstrings inserted here.

Documentation

Citation

1 - Getting Started

A first example

1.1 - Install

2 - Guides

2.1 - Coercing numbers

The core idea

Import

Scalars: numbers as expressions

Multiple values: lists

Matrices: nested lists

Header substitutions: symbolic values

Exact arithmetic with Fraction

What this function guarantees

When to use numeric coercion directly

Design philosophy

2.2 - Rendering data

Design goals

Overview of the API

render_variable

Behavior

Example

render_array

Parameters

Formatting rules

Example: vector

Example: matrix

render_mapping

Behavior

Example

YAML compatibility

When to use render_data

Relationship to SchemaCommandLine

Summary

2.3 - SchemaProjector

Basic example

Validation model

Required vs optional

Unknown keys are rejected

Additional schema restrictions

Errors and what to catch

Schema forms: single vs multi-schema

Single schema (no name)

Multiple schemas with inheritance

Inheritance semantics (current behavior)

Child-schemas (nested schemas)

Declaring child-schemas

Input shape

Recursive child-schemas (grandchild-schemas)

How child-schemas are processed

Accessing child-schema results

Restrictions

Searching for a schema name under a projector (advanced)

Variable definitions

Variable copy / update / delete / conflict (advanced)

Supported types

Vector/matrix shaping rules

Arrays, substitutions, and named tokens

Supplying additional choices at runtime

Accessing results

Aliasing and object construction

How get_instance works (current behavior)

Extending domains

2.4 - SchemaCommandLine

High-level flow

Basic example

Entry point: main()

Schema-driven argparse

Key behaviors

Variable → CLI option mapping

Child-schemas

Declaring child-schemas

CLI shape: bare tokens + options

Nested child-schemas (grandchild-schemas)

child_schema_mode: exclusive vs inclusive

Error messages for invalid tokens

Help listing for child-schemas

Tab completion (argcomplete)

Data merging with child-schemas

Exact arithmetic with `Fraction`

How `get_instance` works (current behavior)

Entry point: `main()`

`child_schema_mode`: `exclusive` vs `inclusive`

Extending behavior: `_operations()`

Schema `code_alias` interaction

Consistency between `choices` and `ext2int`

`precision_from`

`array_style_from`