SchemaProjector

SchemaProjector is the core validation and typing engine of DataSchemer.

It takes:

  • a schema definition (a dictionary)
  • raw input data (a dictionary, often containing strings or loosely typed values)

and produces validated, typed Python data.

Validation happens during construction: if anything is missing, unknown, or invalid, SchemaProjector(...) raises a DataSchemer user-facing exception.

This page documents the behavior that exists in the current implementation.


Basic example

from data_schemer.schema_projector import SchemaProjector

schema = {
  "variables": {
    "a": {"type": "int", "optional": False},
    "b": {"type": "float", "optional": False},
  }
}

raw = {"a": "1", "b": "2.5"}
p = SchemaProjector(schema, raw)

assert p.data["a"] == 1
assert p.data["b"] == 2.5

Validation model

Required vs optional

In DataSchemer, variables are required by default.

  • Required variables must be present in the input.
  • Optional variables are explicitly marked with optional: true.
  • A default may only be specified for optional variables. (If a required variable defines a default, schema loading raises an error.)

Defaults are applied before validation:

  • If an optional variable has a default and the user did not provide a value, the default is injected into raw_data.

Unknown keys are rejected

SchemaProjector is strict: unknown input keys are an error.

If input_data contains a key that is not declared in the schema’s variables, construction fails with DSUnknownNameError (including close-match suggestions).

This strictness is intentional: it catches typos early and keeps schemas authoritative.

Additional schema restrictions

SchemaProjector enforces:

  • requires: a variable may require one or more other variables to be present
  • conflicts: sets of variables where at most one may be defined
  • choices: restrict allowed values (including support for string-list)

Errors and what to catch

SchemaProjector raises user-facing DataSchemer exceptions from data_schemer.errors, including:

  • DSMissingRequiredError — required variables were not provided
  • DSUnknownNameError — schema name not found (in multi-schema form) or input contains unknown variable(s)
  • DSInvalidChoiceError — an input value is not in a variable’s choices
  • DSCoercionError — type coercion failed (includes variable name, raw value, expected type, and details)
  • DSUserError — general user-facing schema errors (e.g., conflicts, requires)

When embedding DataSchemer, it is usually sufficient to catch DSUserError (the base class), unless you want custom handling per subtype.


Schema forms: single vs multi-schema

Single schema (no name)

If the schema dictionary contains a top-level variables key, it is treated as a single schema and given the default name "default" (unless you pass an explicit name).

schema = {
  "variables": {
    "x": {"type": "int", "optional": False},
  }
}

p = SchemaProjector(schema, {"x": "3"})
assert p.schema_name == "default"

Multiple schemas with inheritance

A schema dictionary may contain multiple named schemas. A derived schema may list bases in inherit.

schema = {
  "base": {
    "variables": {
      "a": {"type": "int", "optional": False},
    }
  },
  "derived": {
    "inherit": ["base"],
    "variables": {
      "mode": {
        "type": "string",
        "optional": False,
        "choices": ["fast", "accurate"]
      }
    }
  }
}

p = SchemaProjector(schema, {"a": "2", "mode": "fast"}, "derived")
assert p.data["a"] == 2
assert p.data["mode"] == "fast"

Inheritance semantics (current behavior)

  • Base schemas are applied first; child schema variables override base variables of the same name.
  • Inheritance is recursive: bases may themselves inherit from other schemas.
  • If schema_definitions contains multiple schemas, schema_name is required; omitting it raises DSUserError.
  • If schema_name does not exist, DSUnknownNameError is raised with suggested schema names.

Note: the implementation treats inherit as an ordered list but does not explicitly define a conflict rule when multiple bases define the same variable name. In practice, the recursive update order determines the winner. If you rely on multiple inheritance, it is worth standardizing and documenting a precedence rule (or disallowing ambiguous overlaps).


Variable definitions

A schema’s variables mapping associates each variable name with a definition dictionary. Common keys include:

  • type (required; see below)
  • optional (default: False)
  • default (optional variables only)
  • choices (list of allowed values)
  • requires (list of required companion variables)
  • code_alias (rename when constructing objects; see below)
  • help (Description of variables that may be used in documentation)
  • metavar (String passed to documentation)

Variable copy / update / delete / conflict (advanced)

In multi-schema definitions, the current implementation supports schema-level variable transforms:

  • copy_variables: define a new variable by copying an existing variable from another schema using "schema@var" syntax
  • update_variables: patch selected fields of existing variable definitions
  • delete_variables: remove variables from the final merged set
  • conflicts: sets of variables where at most one may be defined

These features are powerful, but they also introduce complexity. If you expect users to rely on them, add a short dedicated section with one concrete example for each.


Supported types

Built-in types are defined in SchemaProjector._type_definitions. The most commonly used are:

Scalars:

  • int
  • float
  • fraction (fractions.Fraction)
  • bool (accepts True/False or "true"/"false", case-insensitive)
  • string
  • none (identity)

Structured:

  • string-list (string → [string], list → list of strings)
  • dict-string-to-float-matrix
  • dict-string-to-float-vector

Numeric arrays:

  • float-vector, float-matrix (support expressions + substitutions)
  • int-vector, int-matrix
  • fraction-vector, fraction-matrix
  • string-vector, string-matrix

Vector/matrix shaping rules

Array-like types are coerced into NumPy arrays and normalized to shape:

  • vectors: always 1D (reshape(-1))
  • matrices:
    • scalar → (1, 1)
    • 1D → (1, N)
    • 2D → unchanged
    • higher-D → flattened into rows (reshape(-1, last_dim))

Arrays, substitutions, and named tokens

For float-vector and float-matrix, numeric coercion supports expressions and header substitutions (see numeric coercion docs).

In addition, SchemaProjector supports a small set of named array tokens for certain array types:

  • I → identity matrix (3×3)

Tokens may be prefixed by an integer multiplier:

schema = {"variables": {"m": {"type": "float-matrix", "optional": False}}}

SchemaProjector(schema, {"m": "I"}).data["m"]    # identity
SchemaProjector(schema, {"m": "2I"}).data["m"]   # 2 × identity

Token matching is currently:

  • only for specific array types (float-matrix, float-vector, int-matrix, int-vector, fraction-matrix, fraction-vector)
  • only when the raw value is a string
  • based on a leading integer multiplier + token name (e.g. "12I")

If you plan to add more tokens (e.g. O for zeros, E for ones, diag(...), etc.), list them here as they become public API.


Supplying additional choices at runtime

The constructor accepts an optional variable_choices mapping:

p = SchemaProjector(schema, raw, variable_choices={"mode": ["debug", "safe"]})

This extends (or adds) the schema choices for the specified variables. If variable_choices references an unknown variable name, DSUnknownNameError is raised (with suggestions).

This is useful when a higher-level tool wants to allow extra modes without editing the schema file.


Accessing results

SchemaProjector provides:

  • data — typed output (dict)
  • raw_data — raw input after defaults are applied (dict)
  • data_tuple — namedtuple view of data
  • schema_variables — resolved variable definitions for the chosen schema (merged with inherited variables)
  • schema_name — resolved schema name
  • conflicts — merged conflict lists (including inherited conflicts)

String renderings (YAML-style):

  • raw_data_stringrender_mapping(self.raw_data)
  • data_stringrender_mapping(self.data)

Aliasing and object construction

A variable may define code_alias, which renames it when instantiating a target class:

class Foo:
  def __init__(self, x):
    self.x = x

schema = {
  "variables": {
    "a": {"type": "int", "optional": False, "code_alias": "x"},
  }
}

p = SchemaProjector(schema, {"a": "5"})
obj = p.get_instance(Foo)

assert obj.x == 5

How get_instance works (current behavior)

  • The projector builds aliased_data: values are renamed according to code_alias where present.
  • get_instance(cls) inspects cls.__init__ and passes only those keys that match constructor parameter names.
  • Extra data keys are ignored for instantiation purposes (but unknown input keys are still rejected during projection).

This makes it easy to use the same schema both for validation and for building an object without manually wiring parameter names.


Extending domains

You can extend supported types and array tokens by creating a derived projector class:

Projector2 = SchemaProjector.domain_definitions(
  type_definitions={"double": lambda x: int(x) * 2},
  array_by_name={"Z": "0 0 0 ; 0 0 0 ; 0 0 0"},
)

schema = {"variables": {"a": {"type": "double", "optional": False}}}
p = Projector2(schema, {"a": "4"})
assert p.data["a"] == 8

domain_definitions(...) returns a subclass whose definitions extend the built-ins. If a name collides, the new definition overrides the old one.


Last modified January 20, 2026: updated schema projector (3cac3ed)