SchemaProjector

SchemaProjector is the core validation and typing engine of DataSchemer.

It takes:

a schema definition (a dictionary)
raw input data (a dictionary, often containing strings or loosely typed values)

and produces validated, typed Python data.

Validation happens during construction: if anything is missing, unknown, or invalid, SchemaProjector(...) raises a DataSchemer user-facing exception.

This page documents the behavior that exists in the current implementation.

Basic example

from data_schemer.schema_projector import SchemaProjector

schema = {
  "variables": {
    "a": {"type": "int", "optional": False},
    "b": {"type": "float", "optional": False},
  }
}

raw = {"a": "1", "b": "2.5"}
p = SchemaProjector(schema, raw)

assert p.data["a"] == 1
assert p.data["b"] == 2.5

Validation model

Required vs optional

In DataSchemer, variables are required by default.

Required variables must be present in the input.
Optional variables are explicitly marked with optional: true.
A default may only be specified for optional variables. (If a required variable defines a default, schema loading raises an error.)

Defaults are applied before validation:

If an optional variable has a default and the user did not provide a value, the default is injected into raw_data.

Unknown keys are rejected

SchemaProjector is strict: unknown input keys are an error.

If input_data contains a key that is not declared in the schema’s variables, construction fails with DSUnknownNameError (including close-match suggestions).

This strictness is intentional: it catches typos early and keeps schemas authoritative.

Additional schema restrictions

SchemaProjector enforces:

requires: a variable may require one or more other variables to be present
conflicts: sets of variables where at most one may be defined
choices: restrict allowed values (including support for string-list)

Errors and what to catch

SchemaProjector raises user-facing DataSchemer exceptions from data_schemer.errors, including:

DSMissingRequiredError — required variables were not provided
DSUnknownNameError — schema name not found (in multi-schema form) or input contains unknown variable(s)
DSInvalidChoiceError — an input value is not in a variable’s choices
DSCoercionError — type coercion failed (includes variable name, raw value, expected type, and details)
DSUserError — general user-facing schema errors (e.g., conflicts, requires)

When embedding DataSchemer, it is usually sufficient to catch DSUserError (the base class), unless you want custom handling per subtype.

Schema forms: single vs multi-schema

Single schema (no name)

If the schema dictionary contains a top-level variables key, it is treated as a single schema and given the default name "default" (unless you pass an explicit name).

schema = {
  "variables": {
    "x": {"type": "int", "optional": False},
  }
}

p = SchemaProjector(schema, {"x": "3"})
assert p.schema_name == "default"

Multiple schemas with inheritance

A schema dictionary may contain multiple named schemas. A derived schema may list bases in inherit.

schema = {
  "base": {
    "variables": {
      "a": {"type": "int", "optional": False},
    }
  },
  "derived": {
    "inherit": ["base"],
    "variables": {
      "mode": {
        "type": "string",
        "optional": False,
        "choices": ["fast", "accurate"]
      }
    }
  }
}

p = SchemaProjector(schema, {"a": "2", "mode": "fast"}, "derived")
assert p.data["a"] == 2
assert p.data["mode"] == "fast"

Inheritance semantics (current behavior)

Base schemas are applied first; child schema variables override base variables of the same name.
Inheritance is recursive: bases may themselves inherit from other schemas.
If schema_definitions contains multiple schemas, schema_name is required; omitting it raises DSUserError.
If schema_name does not exist, DSUnknownNameError is raised with suggested schema names.

Note: the implementation treats inherit as an ordered list but does not explicitly define a conflict rule when multiple bases define the same variable name. In practice, the recursive update order determines the winner. If you rely on multiple inheritance, it is worth standardizing and documenting a precedence rule (or disallowing ambiguous overlaps).

Child-schemas (nested schemas)

Child-schemas let a schema declare named child schemas that appear as nested mappings in input and are projected recursively.

Declaring child-schemas

At the schema level, declare child-schemas using child_schemas:

root:
  variables:
    x:
      type: int
      optional: true
  child_schemas: [alpha, beta]

alpha:
  variables:
    a:
      type: int
      optional: true

beta:
  variables:
    b:
      type: int
      optional: true

child_schemas may be a string (single name) or a list of strings.
Child-schemas accumulate through inheritance (bases first, then child), with duplicates removed while preserving first occurrence order.

Input shape

Child-schema blocks are provided under their schema name and must be a mapping/dict:

raw = {
  "x": 1,
  "alpha": {"a": "2"},
  "beta": {"b": "3"},
}

p = SchemaProjector(schema, raw, schema_name="root")
assert p.data["x"] == 1
assert p.child_projectors["alpha"].data["a"] == 2
assert p.child_projectors["beta"].data["b"] == 3

If a child-schema key is present but not a dict (e.g. list or string), construction fails with DSUserError.

If a child-schema key is present with value None, it is treated as an empty mapping.

Recursive child-schemas (grandchild-schemas)

Child-schemas may themselves declare child_schemas, creating arbitrary depth:

root:
  child_schemas: [alpha]

alpha:
  child_schemas: [alpha_one]

alpha_one:
  child_schemas: [alpha_leaf]

alpha_leaf:
  variables:
    leaf:
      type: int
      optional: true

Input mirrors the nesting:

raw = {
  "alpha": {
    "alpha_one": {
      "alpha_leaf": {"leaf": "7"}
    }
  }
}
p = SchemaProjector(schema, raw, schema_name="root")
assert p.data_tree["alpha"]["alpha_one"]["alpha_leaf"]["leaf"] == 7

How child-schemas are processed

During construction:

Raw input is copied.
Keys matching declared child_schemas are carved out into raw_child_data and removed from the current-level raw data.
The remaining keys are validated as normal variables (variables).
Each child-schema gets its own SchemaProjector, recursively, using its carved-out mapping as input.

This keeps unknown-key checking strict at every level.

Accessing child-schema results

SchemaProjector exposes three related views:

child_projectors: {child_schema_name: SchemaProjector} (authoritative)
child_data: derived view {child_schema_name: projector.data}
data_tree: a recursive snapshot of typed data for the whole tree

Example:

p = SchemaProjector(schema, raw, schema_name="root")

# projector view (best when you need schema-level info)
alpha_p = p.child_projectors["alpha"]
print(alpha_p.schema_name, alpha_p.data)

# data view (typed dicts)
print(p.child_data["alpha"])

# full tree (typed nested structure)
print(p.data_tree)

Restrictions

To keep behavior well-defined:

A child-schema may not define required_file. If it does, schema loading raises DSUserError.
Every child-schema name in child_schemas must exist in schema_definitions, otherwise DSUnknownNameError is raised.

Searching for a schema name under a projector (advanced)

For tooling and introspection:

matches = p.find_projectors_with_paths("alpha_leaf")
# [(("alpha", "alpha_one", "alpha_leaf"), <SchemaProjector ...>)]

This returns all matches; it does not assume schema names are unique in the tree.

Variable definitions

A schema’s variables mapping associates each variable name with a definition dictionary. Common keys include:

type (required; see below)
optional (default: False)
default (optional variables only)
choices (list of allowed values)
requires (list of required companion variables)
code_alias (rename when constructing objects; see below)
help (Description of variables that may be used in documentation)
metavar (String passed to documentation)

Variable copy / update / delete / conflict (advanced)

In multi-schema definitions, the current implementation supports schema-level variable transforms:

copy_variables: define a new variable by copying an existing variable from another schema using "schema@var" syntax
update_variables: patch selected fields of existing variable definitions
delete_variables: remove variables from the final merged set
conflicts: sets of variables where at most one may be defined

These features are powerful, but they also introduce complexity. If you expect users to rely on them, add a short dedicated section with one concrete example for each.

Supported types

Built-in types are defined in SchemaProjector._type_definitions. The most commonly used are:

Scalars:

int
float
fraction (fractions.Fraction)
bool (accepts True/False or "true"/"false", case-insensitive)
string
none (identity)

Structured:

string-list (string → [string], list → list of strings)
dict-string-to-float-matrix
dict-string-to-float-vector

Numeric arrays:

float-vector, float-matrix (support expressions + substitutions)
int-vector, int-matrix
fraction-vector, fraction-matrix
string-vector, string-matrix

Vector/matrix shaping rules

Array-like types are coerced into NumPy arrays and normalized to shape:

vectors: always 1D (reshape(-1))
matrices:
- scalar → (1, 1)
- 1D → (1, N)
- 2D → unchanged
- higher-D → flattened into rows (reshape(-1, last_dim))

Arrays, substitutions, and named tokens

For float-vector and float-matrix, numeric coercion supports expressions and header substitutions (see numeric coercion docs).

In addition, SchemaProjector supports a small set of named array tokens for certain array types:

I → identity matrix (3×3)

Tokens may be prefixed by an integer multiplier:

schema = {"variables": {"m": {"type": "float-matrix", "optional": False}}}

SchemaProjector(schema, {"m": "I"}).data["m"]    # identity
SchemaProjector(schema, {"m": "2I"}).data["m"]   # 2 × identity

Token matching is currently:

only for specific array types (float-matrix, float-vector, int-matrix, int-vector, fraction-matrix, fraction-vector)
only when the raw value is a string
based on a leading integer multiplier + token name (e.g. "12I")

If you plan to add more tokens (e.g. O for zeros, E for ones, diag(...), etc.), list them here as they become public API.

Supplying additional choices at runtime

The constructor accepts an optional variable_choices mapping:

p = SchemaProjector(schema, raw, variable_choices={"mode": ["debug", "safe"]})

This extends (or adds) the schema choices for the specified variables. If variable_choices references an unknown variable name, DSUnknownNameError is raised (with suggestions).

This is useful when a higher-level tool wants to allow extra modes without editing the schema file.

Accessing results

SchemaProjector provides:

data — typed output (dict)
raw_data — raw input after defaults are applied (dict)
data_tuple — namedtuple view of data
schema_variables — resolved variable definitions for the chosen schema (merged with inherited variables)
schema_name — resolved schema name
conflicts — merged conflict lists (including inherited conflicts)

String renderings (YAML-style):

raw_data_string — render_mapping(self.raw_data)
data_string — render_mapping(self.data)

Aliasing and object construction

A variable may define code_alias, which renames it when instantiating a target class:

class Foo:
  def __init__(self, x):
    self.x = x

schema = {
  "variables": {
    "a": {"type": "int", "optional": False, "code_alias": "x"},
  }
}

p = SchemaProjector(schema, {"a": "5"})
obj = p.get_instance(Foo)

assert obj.x == 5

How `get_instance` works (current behavior)

The projector builds aliased_data: values are renamed according to code_alias where present.
get_instance(cls) inspects cls.__init__ and passes only those keys that match constructor parameter names.
Extra data keys are ignored for instantiation purposes (but unknown input keys are still rejected during projection).

This makes it easy to use the same schema both for validation and for building an object without manually wiring parameter names.

Extending domains

You can extend supported types and array tokens by creating a derived projector class:

Projector2 = SchemaProjector.domain_definitions(
  type_definitions={"double": lambda x: int(x) * 2},
  array_by_name={"Z": "0 0 0 ; 0 0 0 ; 0 0 0"},
)

schema = {"variables": {"a": {"type": "double", "optional": False}}}
p = Projector2(schema, {"a": "4"})
assert p.data["a"] == 8

domain_definitions(...) returns a subclass whose definitions extend the built-ins. If a name collides, the new definition overrides the old one.

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified January 30, 2026: changed from sub nomenclature to child. (d8fbcf3)