This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

DataSchemer (DS) is a lightweight, schema-driven library for parsing, validating, and typing structured user input into clean, well-defined Python data structures.

DS is designed to sit at the boundary between user-facing interfaces (command-line tools, configuration files, simple APIs) and internal application logic. Schemas describe what data is expected and how it should be interpreted; DS handles type coercion, validation, defaults, and structural mapping in a consistent, reusable way. DS can naturally be employed within a simple script requiring user input, or a substantial software suite with a deep command line interface.

Key ideas include:

  • Declarative schemas expressed as plain Python dictionaries, with support for reuse and extension via inheritance.
  • Strict, predictable parsing, with optional support for simple mathematical expressions (basic arithmetic), evaluated explicitly and safely.
  • First-class CLI integration, including help text, validation, and tab completion
  • Minimal assumptions, making the core a useful utility independent of CLI applications.

DataSchemer serves as the backbone for user input in the Principia Materia software suite, but DS is a standalone utility that should be useful in other applications.

Citation

If DataSchemer is used in published academic work, please cite it. A recommended citation is provided in CITATION.cff (see repository).

1 - Getting Started

DataSchemer lets you define the shape of your input once, and then rely on it to parse, validate, and structure user input consistently across your application.

At its core, DataSchemer uses schemas to describe what input is expected and how it should be interpreted. From that description, it handles type coercion, validation, defaults, and structured results for you.

A first example

Suppose your program expects a set of lattice vectors and an optional scale factor. You would like to accept human-friendly input, but work internally with numeric data.

A simple example

from data_schemer.schema_projector import SchemaProjector

schema = {
  "variables": {
    "lattice_vectors": {
      "type": "float-matrix",
      "help": "3×3 lattice vectors",
    },
    "scale": {
      "type": "float",
      "optional": True,
      "default": 1.0,
      "help": "Scale factor applied to the lattice vectors",
    },
  }
}

input_data = {
  "lattice_vectors": """
    0 1 1
    1 0 1
    1 1 0
  """,
  "scale": "4.8",
}

result = SchemaProjector(schema, input_data)

print (result.data)

The output is

{
  'lattice_vectors': array([[0., 1., 1.],
                            [1., 0., 1.],
                            [1., 1., 0.]]),
  'scale': 4.8
}

Basic substitution and arithmetic is safely supported. A hexagonal lattice may be entered as

input_data_tri = {
  "lattice_vectors": """\
    a=2.23  c=5.8
    a*r3/2  a/2  0
    a*r3/2 -a/2  0
    0       0    c
  """,
  "scale": "1.0",
}

result = SchemaProjector(schema, input_data_tri)

print(result.data["lattice_vectors"])

yielding

[[ 1.93123665  1.115       0.        ]
 [ 1.93123665 -1.115       0.        ]
 [ 0.          0.          5.8       ]]

A command line interface can automatically be constructed from the schema as follows

# test.py
from data_schemer.schema_command_line import SchemaCommandLine 
scl = SchemaCommandLine(schema)

print (scl.data['lattice_vectors'])

Executing on the command line gives

$ python test.py --lattice-vectors '0 1/2 1/2 ; 1/2 0 1/2 ; 1/2 1/2 0'
[[0.  0.5 0.5]
 [0.5 0.  0.5]
 [0.5 0.5 0. ]]

where the text string is parsed using a hierarchy of delimiters. Help menus are automatically generated from the schema information.

$ python test.py -h

usage: tt.py [-h] [--lattice-vectors X] [--scale X]

options:
  -h, --help
                        show this help message and exit
  --lattice-vectors X
                        (required) 3×3 lattice vectors
  --scale X    Scale factor applied to the lattice vectors

1.1 - Install

DataSchemer can be installed using pip. Very soon, you will be able to install from PyPI using

pip install data-schemer

Until then, clone the repo https://github.com/marianettigroup/data-schemer . The install using pip, preferably using editable mode.

pip install -e .

2 - Guides

Practical, copy-paste usage examples for DataSchemer.

2.1 - Coercing numbers

DataSchemer is designed to accept human-friendly numeric input and reliably convert it into structured Python data. This is especially important for command-line tools, configuration files, and lightweight data formats, where users naturally write numbers as text, expressions, or simple tables.

The numeric coercion utilities live in coerce_numbers.py and can be used directly, or indirectly through schema-driven workflows.


The core idea

The coercion utilities take loosely structured numeric text and turn it into:

  • Python scalars (int, float, Fraction, …)
  • Python lists of numbers
  • Nested lists representing matrices

The input may contain:

  • arithmetic expressions
  • symbolic substitutions
  • multiple values separated by whitespace or punctuation
  • simple matrix layouts

The goal is to let users write what they mean, without forcing rigid syntax or file formats.


Import

The most commonly used entry point is:

from data_schemer.coerce_numbers import coerce_number_list_with_substitutions

This function always returns Python lists, never NumPy arrays.
Results can be trivially converted to NumPy arrays, or one may import coerce_array_with_substitutions, which returns a NumPy array.


Scalars: numbers as expressions

At the simplest level, a single number—written as text—is parsed and evaluated:

coerce_number_list_with_substitutions("3", dtype=int)
# -> [3]

Expressions are allowed, so users don’t need to precompute values:

coerce_number_list_with_substitutions("1/2", dtype=float)
# -> [0.5]

coerce_number_list_with_substitutions("3*4 + 1", dtype=int)
# -> [13]

Supported arithmetic includes:

  • addition and subtraction: +, -
  • multiplication and division: *, /
  • scientific notation: 1e-3, 2E+4
  • square roots using rN notation:
coerce_number_list_with_substitutions("r2", dtype=float)
# -> [1.4142135623730951]

This is intentionally small but predictable: enough power for scientific input, without becoming a general-purpose programming language.


Multiple values: lists

Most real inputs contain more than one number. Values can be separated by:

  • spaces
  • commas
  • semicolons
  • newlines

All of the following are equivalent:

coerce_number_list_with_substitutions("1 2 3", dtype=int)
# -> [1, 2, 3]

coerce_number_list_with_substitutions("1,2,3", dtype=int)
# -> [1, 2, 3]

coerce_number_list_with_substitutions("1; 2; 3", dtype=int)
# -> [1, 2, 3]

This makes the parser forgiving and easy to use in CLI contexts.


Matrices: nested lists

When separators imply rows (such as newlines or semicolons), the result becomes a nested list:

coerce_number_list_with_substitutions("1,0;0,1", dtype=int)
# -> [[1, 0], [0, 1]]

Newlines work naturally:

text = """
1 0 0
0 1 0
0 0 1
"""

coerce_number_list_with_substitutions(text, dtype=int)
# -> [[1, 0, 0], [0, 1, 0], [0, 0, 1]]

At this stage, the structure is purely Python lists—no NumPy assumptions are made.


Header substitutions: symbolic values

One of the most powerful features is header substitutions.

You can define symbols at the top of the input and reuse them below:

text = """
a=0.5 b=1/3
a, b
0.0, 2*a
"""

coerce_number_list_with_substitutions(text, dtype=float)
# -> [[0.5, 0.3333333333333333], [0.0, 1.0]]

How this works:

  1. The first line defines substitutions (a, b)
  2. These symbols are available in all subsequent expressions
  3. Expressions are evaluated after substitution

This is especially useful for:

  • lattice vectors
  • parameterized matrices
  • avoiding repeated numeric constants

Exact arithmetic with Fraction

By default, numeric expressions are evaluated using floating-point arithmetic. If you need exact rational values, you can request Fraction explicitly:

from fractions import Fraction

coerce_number_list_with_substitutions("1/3", dtype=Fraction)
# -> [Fraction(1, 3)]

This applies consistently to expressions and substitutions:

text = """
a=1/3
2*a
"""

coerce_number_list_with_substitutions(text, dtype=Fraction)
# -> [Fraction(2, 3)]

This is particularly useful in symbolic or group-theoretical contexts where exact ratios matter.


What this function guarantees

  • Output is always a Python list (possibly nested)
  • All numeric values are of the requested dtype
  • Expressions are evaluated deterministically
  • Substitutions are scoped to the input block
  • No NumPy dependency is introduced at this level

When to use numeric coercion directly

Use these utilities directly when you are:

  • parsing numeric text files
  • accepting numeric expressions from users
  • handling CLI arguments that represent vectors or matrices
  • preprocessing data before schema validation

In schema-driven workflows, you typically won’t call this yourself. SchemaProjector invokes numeric coercion automatically when a schema variable is declared as numeric.


Design philosophy

The numeric coercion layer is intentionally:

  • permissive in input syntax
  • strict in output structure
  • predictable and reproducible
  • free of side effects

It acts as a bridge between human-readable numeric text and strongly-typed Python data—without imposing unnecessary ceremony.

2.2 - Rendering data

DataSchemer includes small, focused utilities for rendering Python data into readable, YAML-compatible text. These tools are intended for human-facing output: command-line tools, reports, and snippets that users may want to copy directly into configuration files.

The rendering utilities live in render_data.py.

They are not a full serialization system. Instead, they provide:

  • stable, predictable formatting
  • readable alignment for numeric data
  • minimal YAML-compatible syntax
  • no implicit I/O or file handling

Design goals

The rendering layer is intentionally conservative.

It aims to:

  • produce output that is easy to read in a terminal
  • remain valid YAML when pasted into a file
  • avoid introducing YAML features that obscure structure
  • keep formatting decisions explicit and reproducible

For full serialization, schema validation, or round-tripping, standard YAML libraries should be used instead.


Overview of the API

The public rendering API consists of three main functions:

  • render_variable
  • render_array
  • render_mapping

These functions are designed to compose cleanly: render_variable dispatches to render_array or render_mapping as needed.


render_variable

render_variable(name, value, *, indent=0)

Render a single named variable and its value as YAML-style text.

This is the primary entry point used by DataSchemer command-line tools when emitting results.

Behavior

  • Prepends the variable name followed by :
  • Delegates rendering of the value based on its type
  • Applies indentation consistently
  • Produces YAML-compatible output

Example

from data_schemer.render_data import render_variable

render_variable("temperature", 300)

Output:

temperature: 300

Arrays and mappings are rendered on subsequent lines with indentation:

import numpy as np

render_variable("vector", np.array([1, 2, 3]))
vector:
  [ 1, 2, 3 ]

render_array

render_array(array, *, indent=0)

Render a one- or two-dimensional array as aligned, readable text.

This function accepts:

  • NumPy arrays
  • nested Python lists
  • array-like objects convertible to NumPy

Formatting rules

  • Arrays are rendered using bracket notation ([ ... ])
  • Rows are aligned vertically for readability
  • Floating-point values are trimmed of insignificant zeros
  • Negative zero is suppressed (-0.0 → 0)
  • Output is valid YAML (inline sequences)

Example: vector

import numpy as np
from data_schemer.render_data import render_array

render_array(np.array([0.5, 0.0, 1.0]))
[ 0.5, 0, 1 ]

Example: matrix

matrix = np.array([
  [0.5, 0.5, 0.5],
  [0.0, 0.0, 0.0],
  [0.5, 0.0, 0.0],
])

render_array(matrix, indent=2)
  [[ 0.5, 0.5, 0.5 ],
   [ 0  , 0  , 0   ],
   [ 0.5, 0  , 0   ]]

The formatting prioritizes visual alignment over compactness.


render_mapping

render_mapping(mapping, *, indent=0)

Render a mapping (typically a dictionary) as a YAML-style block.

Behavior

  • Keys are rendered in insertion order
  • Each key is rendered using render_variable
  • Nested mappings increase indentation
  • Values may be scalars, arrays, or other mappings

Example

from data_schemer.render_data import render_mapping
import numpy as np

data = {
  "positions": np.array([
    [0.5, 0.5, 0.5],
    [0.0, 0.0, 0.0],
  ]),
  "lattice_constant": 3.905,
}

print(render_mapping(data))

Output:

positions:
  [[ 0.5, 0.5, 0.5 ],
   [ 0  , 0  , 0   ]]
lattice_constant: 3.905

YAML compatibility

The output produced by the rendering utilities is:

  • syntactically valid YAML
  • intentionally minimal
  • free of tags, anchors, or advanced YAML constructs

This makes it suitable for:

  • copy-paste into configuration files
  • inclusion in documentation
  • inspection in terminal output

However, the rendering layer does not guarantee round-tripping back to the original Python object.


When to use render_data

Use the rendering utilities when you want:

  • human-readable numeric output
  • stable formatting for CLI tools
  • YAML-compatible snippets without full serialization
  • consistent array formatting across tools

Do not use them when you need:

  • schema enforcement
  • automatic file writing
  • full YAML feature support
  • guaranteed round-trip fidelity

Relationship to SchemaCommandLine

In schema-driven workflows, these functions are typically invoked indirectly.

SchemaCommandLine uses render_variable to emit results in a consistent, user-facing format, ensuring that command output is both readable and reusable.


Summary

The rendering layer is intentionally small:

  • render_variable handles named output
  • render_array handles numeric structure
  • render_mapping handles hierarchical data

Together, they provide a predictable bridge between Python data structures and human-readable, YAML-compatible text.

2.3 - SchemaProjector

SchemaProjector is the core validation and typing engine of DataSchemer.

It takes:

  • a schema definition (a dictionary)
  • raw input data (a dictionary, often containing strings or loosely typed values)

and produces validated, typed Python data.

Validation happens during construction: if anything is missing, unknown, or invalid, SchemaProjector(...) raises a DataSchemer user-facing exception.

This page documents the behavior that exists in the current implementation.


Basic example

from data_schemer.schema_projector import SchemaProjector

schema = {
  "variables": {
    "a": {"type": "int", "optional": False},
    "b": {"type": "float", "optional": False},
  }
}

raw = {"a": "1", "b": "2.5"}
p = SchemaProjector(schema, raw)

assert p.data["a"] == 1
assert p.data["b"] == 2.5

Validation model

Required vs optional

In DataSchemer, variables are required by default.

  • Required variables must be present in the input.
  • Optional variables are explicitly marked with optional: true.
  • A default may only be specified for optional variables. (If a required variable defines a default, schema loading raises an error.)

Defaults are applied before validation:

  • If an optional variable has a default and the user did not provide a value, the default is injected into raw_data.

Unknown keys are rejected

SchemaProjector is strict: unknown input keys are an error.

If input_data contains a key that is not declared in the schema’s variables, construction fails with DSUnknownNameError (including close-match suggestions).

This strictness is intentional: it catches typos early and keeps schemas authoritative.

Additional schema restrictions

SchemaProjector enforces:

  • requires: a variable may require one or more other variables to be present
  • conflicts: sets of variables where at most one may be defined
  • choices: restrict allowed values (including support for string-list)

Errors and what to catch

SchemaProjector raises user-facing DataSchemer exceptions from data_schemer.errors, including:

  • DSMissingRequiredError — required variables were not provided
  • DSUnknownNameError — schema name not found (in multi-schema form) or input contains unknown variable(s)
  • DSInvalidChoiceError — an input value is not in a variable’s choices
  • DSCoercionError — type coercion failed (includes variable name, raw value, expected type, and details)
  • DSUserError — general user-facing schema errors (e.g., conflicts, requires)

When embedding DataSchemer, it is usually sufficient to catch DSUserError (the base class), unless you want custom handling per subtype.


Schema forms: single vs multi-schema

Single schema (no name)

If the schema dictionary contains a top-level variables key, it is treated as a single schema and given the default name "default" (unless you pass an explicit name).

schema = {
  "variables": {
    "x": {"type": "int", "optional": False},
  }
}

p = SchemaProjector(schema, {"x": "3"})
assert p.schema_name == "default"

Multiple schemas with inheritance

A schema dictionary may contain multiple named schemas. A derived schema may list bases in inherit.

schema = {
  "base": {
    "variables": {
      "a": {"type": "int", "optional": False},
    }
  },
  "derived": {
    "inherit": ["base"],
    "variables": {
      "mode": {
        "type": "string",
        "optional": False,
        "choices": ["fast", "accurate"]
      }
    }
  }
}

p = SchemaProjector(schema, {"a": "2", "mode": "fast"}, "derived")
assert p.data["a"] == 2
assert p.data["mode"] == "fast"

Inheritance semantics (current behavior)

  • Base schemas are applied first; child schema variables override base variables of the same name.
  • Inheritance is recursive: bases may themselves inherit from other schemas.
  • If schema_definitions contains multiple schemas, schema_name is required; omitting it raises DSUserError.
  • If schema_name does not exist, DSUnknownNameError is raised with suggested schema names.

Note: the implementation treats inherit as an ordered list but does not explicitly define a conflict rule when multiple bases define the same variable name. In practice, the recursive update order determines the winner. If you rely on multiple inheritance, it is worth standardizing and documenting a precedence rule (or disallowing ambiguous overlaps).


Variable definitions

A schema’s variables mapping associates each variable name with a definition dictionary. Common keys include:

  • type (required; see below)
  • optional (default: False)
  • default (optional variables only)
  • choices (list of allowed values)
  • requires (list of required companion variables)
  • code_alias (rename when constructing objects; see below)
  • help (Description of variables that may be used in documentation)
  • metavar (String passed to documentation)

Variable copy / update / delete / conflict (advanced)

In multi-schema definitions, the current implementation supports schema-level variable transforms:

  • copy_variables: define a new variable by copying an existing variable from another schema using "schema@var" syntax
  • update_variables: patch selected fields of existing variable definitions
  • delete_variables: remove variables from the final merged set
  • conflicts: sets of variables where at most one may be defined

These features are powerful, but they also introduce complexity. If you expect users to rely on them, add a short dedicated section with one concrete example for each.


Supported types

Built-in types are defined in SchemaProjector._type_definitions. The most commonly used are:

Scalars:

  • int
  • float
  • fraction (fractions.Fraction)
  • bool (accepts True/False or "true"/"false", case-insensitive)
  • string
  • none (identity)

Structured:

  • string-list (string → [string], list → list of strings)
  • dict-string-to-float-matrix
  • dict-string-to-float-vector

Numeric arrays:

  • float-vector, float-matrix (support expressions + substitutions)
  • int-vector, int-matrix
  • fraction-vector, fraction-matrix
  • string-vector, string-matrix

Vector/matrix shaping rules

Array-like types are coerced into NumPy arrays and normalized to shape:

  • vectors: always 1D (reshape(-1))
  • matrices:
    • scalar → (1, 1)
    • 1D → (1, N)
    • 2D → unchanged
    • higher-D → flattened into rows (reshape(-1, last_dim))

Arrays, substitutions, and named tokens

For float-vector and float-matrix, numeric coercion supports expressions and header substitutions (see numeric coercion docs).

In addition, SchemaProjector supports a small set of named array tokens for certain array types:

  • I → identity matrix (3×3)

Tokens may be prefixed by an integer multiplier:

schema = {"variables": {"m": {"type": "float-matrix", "optional": False}}}

SchemaProjector(schema, {"m": "I"}).data["m"]    # identity
SchemaProjector(schema, {"m": "2I"}).data["m"]   # 2 × identity

Token matching is currently:

  • only for specific array types (float-matrix, float-vector, int-matrix, int-vector, fraction-matrix, fraction-vector)
  • only when the raw value is a string
  • based on a leading integer multiplier + token name (e.g. "12I")

If you plan to add more tokens (e.g. O for zeros, E for ones, diag(...), etc.), list them here as they become public API.


Supplying additional choices at runtime

The constructor accepts an optional variable_choices mapping:

p = SchemaProjector(schema, raw, variable_choices={"mode": ["debug", "safe"]})

This extends (or adds) the schema choices for the specified variables. If variable_choices references an unknown variable name, DSUnknownNameError is raised (with suggestions).

This is useful when a higher-level tool wants to allow extra modes without editing the schema file.


Accessing results

SchemaProjector provides:

  • data — typed output (dict)
  • raw_data — raw input after defaults are applied (dict)
  • data_tuple — namedtuple view of data
  • schema_variables — resolved variable definitions for the chosen schema (merged with inherited variables)
  • schema_name — resolved schema name
  • conflicts — merged conflict lists (including inherited conflicts)

String renderings (YAML-style):

  • raw_data_stringrender_mapping(self.raw_data)
  • data_stringrender_mapping(self.data)

Aliasing and object construction

A variable may define code_alias, which renames it when instantiating a target class:

class Foo:
  def __init__(self, x):
    self.x = x

schema = {
  "variables": {
    "a": {"type": "int", "optional": False, "code_alias": "x"},
  }
}

p = SchemaProjector(schema, {"a": "5"})
obj = p.get_instance(Foo)

assert obj.x == 5

How get_instance works (current behavior)

  • The projector builds aliased_data: values are renamed according to code_alias where present.
  • get_instance(cls) inspects cls.__init__ and passes only those keys that match constructor parameter names.
  • Extra data keys are ignored for instantiation purposes (but unknown input keys are still rejected during projection).

This makes it easy to use the same schema both for validation and for building an object without manually wiring parameter names.


Extending domains

You can extend supported types and array tokens by creating a derived projector class:

Projector2 = SchemaProjector.domain_definitions(
  type_definitions={"double": lambda x: int(x) * 2},
  array_by_name={"Z": "0 0 0 ; 0 0 0 ; 0 0 0"},
)

schema = {"variables": {"a": {"type": "double", "optional": False}}}
p = Projector2(schema, {"a": "4"})
assert p.data["a"] == 8

domain_definitions(...) returns a subclass whose definitions extend the built-ins. If a name collides, the new definition overrides the old one.

2.4 - SchemaCommandLine

SchemaCommandLine is DataSchemer’s schema-driven command-line engine.

It combines:

  • argparse parsing (including optional argcomplete tab-completion)
  • YAML input files (optional)
  • an optional required positional file (per-command metadata)
  • SchemaProjector validation and typing
  • optional object instantiation (target_class)
  • and an optional print-attribute facility for introspecting results

Where SchemaProjector is the typing/validation core, SchemaCommandLine is the CLI orchestration layer.


High-level flow

When you construct a SchemaCommandLine, it performs all work immediately:

  1. Resolve schema form/name (normalize_schema_inputs)
  2. Resolve print_attribute configuration (with inheritance)
  3. Build schema variables (including inheritance/copy/update/delete)
  4. Build an argparse parser from schema variables
  5. Parse CLI args (strictly; allow_abbrev=False)
  6. Optionally load YAML input files
  7. Optionally read a required positional file (and optionally parse it as YAML)
  8. Merge all data sources into a single raw input mapping
  9. Validate and type the data via SchemaProjector
  10. Optionally instantiate target_class
  11. Run _operations() (override point for custom logic)
  12. Optionally print attributes requested via --print-attribute

This “do everything in __init__” approach keeps the class simple to embed: construct it once and you either get a successful run or a structured user error.


Basic example

The typical usage pattern is to subclass and set a few class attributes:

from data_schemer.schema_command_line import SchemaCommandLine

class MyCLI(SchemaCommandLine):
  pass

schema = {
  "variables": {
    "a": {"type": "int"},
    "b": {"type": "float"},
  }
}

# Equivalent of: MyCLI(schema_definitions=schema, argv=[...])
MyCLI(schema, argv=["--a", "1", "--b", "2.5"])

In practice, projects usually supply schema_definitions loaded from YAML, and set a target_class so a typed object is available in _operations().


Entry point: main()

SchemaCommandLine.main(..., debug: bool = False) -> int

main() is a convenience wrapper that:

  • returns a process exit code (0 on success)
  • catches DSUserError and prints a friendly message to stderr
  • suppresses tracebacks unless debug=True

This is the recommended entry point for console_scripts:

if __name__ == "__main__":
  raise SystemExit(MyCLI.main(schema, debug="--debug" in sys.argv))

Schema-driven argparse

SchemaCommandLine builds an argparse parser from the resolved schema variables using:

schema_to_argparse(schema_vars, description=..., masquerade_usage=True)

Key behaviors

  • Strict parsing: allow_abbrev=False to avoid ambiguous/accidental option matches.
  • No argparse-required: argparse does not enforce required arguments; the projector does.
  • Schema is the contract: help, metavar, aliases, choices, bool flags are schema-driven.
  • Improved UX: required variables get a (required) marker in colored help (TTY only).

Variable → CLI option mapping

A schema variable named foo_bar becomes:

  • --foo-bar

Aliases may be provided via alias:

  • if an alias starts with -, it is used as-is (short option)
  • otherwise it becomes --alias-name

Example:

variables:
  iterations:
    type: int
    alias: ["-n", "num_iterations"]

This yields options:

  • -n
  • --num-iterations
  • --iterations

Data sources and precedence

SchemaCommandLine can merge values from up to three sources:

  1. YAML input files (optional, via an “input files” variable in the schema)
  2. required_file (optional positional argument configured by schema metadata)
  3. CLI options (always available)

Precedence is:

  1. YAML input files (lowest)
  2. required_file schema data (middle)
  3. CLI options (highest)

This is implemented in _merge_sources():

raw_data = dict(self._input_file_data)
raw_data.update(self._required_file_schema_data)
raw_data.update(self._cli_data)

So a user can keep defaults in YAML and override with CLI flags.


YAML input files

YAML input files are enabled by defining a schema variable (default name is input_files_tag, constructor default is "input_files"):

  • variable must exist in schema vars
  • value is expected to be a list of file paths (often via string-list)
  • each YAML file must parse to a mapping/dict
  • merged sequentially (later files override earlier)

stdin can be used by passing "-" as a YAML input file path (meaning read YAML from stdin).

Important restriction: you cannot use "-" for both required_file and YAML input files in the same invocation.


required_file metadata

A command may declare a required positional file via schema metadata:

required_file:
  enabled: true
  metavar: FILE
  help: Required input file (use '-' for stdin).
  apply_as_schema_data: false
  read_mode: path      # path|text|binary
  stdin_ok: true

Behavior

  • If enabled: true, argparse adds a positional argument required_file.
  • How the file is consumed depends on read_mode:
    • path: do not read content (only store the path); required_file_content is None
    • text: read text content into required_file_content
    • binary: read bytes into required_file_content

apply_as_schema_data

If apply_as_schema_data: true, the required file is always read as text and parsed as YAML. The resulting mapping is merged into schema input data at the “required_file” precedence level.

This is useful for commands where the primary input is a YAML block but you want it as a required positional rather than --input-file.

Accessors

  • required_file_path
  • required_file_content

Unknown options and suggestions

SchemaCommandLine parses with parse_known_args() to detect unknown options and provide better errors.

  • Unknown options beginning with - raise DSUnknownNameError with “did you mean” suggestions.
  • Unexpected positional arguments raise DSUserError.

Suggestions are computed using close matches over normalized option names (including aliases).

This is a major UX improvement over default argparse errors.


target_class and object instantiation

If target_class is provided, SchemaCommandLine will:

  1. validate/type inputs using SchemaProjector
  2. call SchemaProjector.get_instance(target_class)
  3. expose the instantiated object via target_obj

Constructor kwargs are filtered by the target class __init__ signature (unknown kwargs ignored).

This makes schemas usable both for:

  • configuration validation
  • object construction

Extending behavior: _operations()

Override _operations() in subclasses to implement custom logic.

At the time _operations() runs:

  • self.data is available (typed dict)
  • self.target_obj may be available (if target_class provided)
  • required_file state is available

A common pattern is:

  • compute derived quantities
  • write output files
  • call domain-specific libraries

print_attribute is a special schema feature that enables a built-in CLI option:

  • --print-attribute ...

This option lets users request printing one or more @property attributes from the instantiated target object.

It is designed for:

  • interactive exploration
  • debugging / inspection
  • reproducible output (printed in YAML-style using render_mapping)

Requirements

  • print-attribute requires a target_class
  • print-attribute is enabled and configured by schema descriptor configuration (not per-variable)

When enabled, SchemaCommandLine injects a reserved variable into the schema:

  • variable name: print_attribute (reserved; users must not define it)
  • type: string-list
  • optional: true

This injection ensures argparse sees a --print-attribute option without requiring you to bake it into every schema by hand.

Where configuration comes from

SchemaCommandLine resolves print-attribute configuration via:

print_attribute = get_schema_print_attribute(schema_name, schema_definitions)

This supports inheritance and merging rules defined by the schema resolver.

The resolved value may be:

  • False / None: disabled
  • True: enabled, auto mode
  • a dict: enabled, with advanced configuration

After resolution, SchemaCommandLine normalizes the configuration into one of two modes:

Auto mode

Auto mode derives choices by introspecting the target class:

  • all public @property names (including inherited)
  • excluding private names starting with _

Those internal property names are then mapped to external names presented on the CLI.

Manual mode

Manual mode uses an explicit list of external choices provided by schema configuration:

  • no class introspection is used to produce the list
  • values are still mapped to internal attribute names before reading from the object

Manual mode is selected when the resolved configuration dict contains a choices key and it is not null.


This is the most important conceptual point:

  • The CLI accepts and displays external names only.
  • The underlying object is accessed using internal attribute names.

DataSchemer supports two mapping systems that may both apply:

  1. Schema variable code_alias (general DataSchemer feature)
  2. print_attribute.alias (specific to print-attribute)

Both contribute to mapping external → internal attribute names for print-attribute.

Schema code_alias interaction

Schema variable definitions may rename how constructor arguments map to the object:

variables:
  a:
    type: int
    code_alias: x

For print-attribute, SchemaCommandLine builds alias maps from schema vars:

  • internal → external
  • external → internal

and merges them with print-attribute’s explicit alias mapping (below).

In dict form, print-attribute may define:

print_attribute:
  alias:
    external_name: internal_name

This affects only print-attribute resolution (not object construction).

Disjointness requirement

These two alias maps must be disjoint (no overlapping keys), otherwise it becomes ambiguous. The implementation validates disjointness using merge_disjoint_maps(...) and raises DSUserError if they overlap.


In dict form, print-attribute can constrain and control the allowed --print-attribute values.

Manual choices

print_attribute:
  choices: ["energy", "volume"]

This:

  • forces manual mode
  • restricts CLI candidates strictly to those external spellings
  • drives tab-completion candidates (when argcomplete is installed)

If choices: null, then:

  • the configuration remains enabled
  • but it stays in auto mode
  • and (crucially) exclude is suppressed because the choices key is present (see exclude semantics below)

This “choices key present” rule is intentional: it lets inheritance signal “explicit configuration” even when the final list is not specified.

Runtime injection of choices for argparse

To support argparse choices and tab-completion, SchemaCommandLine dynamically injects the computed choices into the injected print_attribute variable only at parser construction time.


In dict form, print-attribute may define an exclude list:

print_attribute:
  exclude: ["debug", "internal_state"]

Exclude is applied only in true auto mode, meaning:

  • the effective dict does not contain a choices key (including inherited configs)

If choices is present (even as null), exclude is ignored by design.

Exclude works on external spellings only:

  • if you exclude "foo", it will remove both the internal and external names that match "foo" from the auto-derived list

Consistency between choices and ext2int

In auto mode, exclude removes entries from:

  • the completion/validation choices list
  • the external→internal mapping (ext2int)

This prevents excluded names from reappearing via alias mappings.


Because schema-level print_attribute is inherited and merged, treat it like a small “policy” object:

  • A base schema can enable print-attribute broadly.
  • A derived schema can:
    • add aliases
    • extend choices
    • add exclusions (in auto mode)
    • or disable print-attribute entirely (false)

The resolver and normalizer together ensure that:

  • the CLI only ever exposes external spellings
  • printing uses render_mapping({external_name: value})
  • and mapping is deterministic.

How print-attribute output looks

When --print-attribute is used, SchemaCommandLine prints each requested attribute in YAML style:

energy: -12.345
volume: 60.0

Internally, it prints one mapping per attribute, so output remains stream-friendly.

If the external name cannot be resolved to an attribute on the object, a DSUserError is raised with details (including internal name).


Tab completion with argcomplete

If argcomplete is installed, SchemaCommandLine enables completion via:

argcomplete.autocomplete(parser)

For --print-attribute, completion candidates are the computed external choices.

The canonical helper for computing them is:

  • compute_print_attribute_choices(...)

Downstream tools (like PM) should call this helper rather than duplicating logic.


Summary

SchemaCommandLine provides a practical, batteries-included way to build CLIs from schemas:

  • schema → argparse options (+ help/aliases/choices)
  • strict parsing + helpful suggestions
  • merge YAML + required_file + CLI with clear precedence
  • validate/type using SchemaProjector
  • optionally instantiate an object
  • optionally print introspected attributes using a carefully designed external/internal mapping system

The print-attribute feature is intentionally rich because it lives at the boundary between schema naming, object naming, and user-visible CLI affordances.

3 - Reference

Sphinx generated documentation from docstrings inserted here.