SchemaProjector
SchemaProjector is the core validation and typing engine of DataSchemer.
It takes:
- a schema definition (a dictionary)
- raw input data (a dictionary, often containing strings or loosely typed values)
and produces validated, typed Python data.
Validation happens during construction: if anything is missing, unknown, or invalid, SchemaProjector(...) raises a DataSchemer user-facing exception.
This page documents the behavior that exists in the current implementation.
Basic example
from data_schemer.schema_projector import SchemaProjector
schema = {
"variables": {
"a": {"type": "int", "optional": False},
"b": {"type": "float", "optional": False},
}
}
raw = {"a": "1", "b": "2.5"}
p = SchemaProjector(schema, raw)
assert p.data["a"] == 1
assert p.data["b"] == 2.5
Validation model
Required vs optional
In DataSchemer, variables are required by default.
- Required variables must be present in the input.
- Optional variables are explicitly marked with
optional: true. - A
defaultmay only be specified for optional variables. (If a required variable defines adefault, schema loading raises an error.)
Defaults are applied before validation:
- If an optional variable has a
defaultand the user did not provide a value, the default is injected intoraw_data.
Unknown keys are rejected
SchemaProjector is strict: unknown input keys are an error.
If input_data contains a key that is not declared in the schema’s variables, construction fails with DSUnknownNameError (including close-match suggestions).
This strictness is intentional: it catches typos early and keeps schemas authoritative.
Additional schema restrictions
SchemaProjector enforces:
requires: a variable may require one or more other variables to be presentconflicts: sets of variables where at most one may be definedchoices: restrict allowed values (including support forstring-list)
Errors and what to catch
SchemaProjector raises user-facing DataSchemer exceptions from data_schemer.errors, including:
DSMissingRequiredError— required variables were not providedDSUnknownNameError— schema name not found (in multi-schema form) or input contains unknown variable(s)DSInvalidChoiceError— an input value is not in a variable’schoicesDSCoercionError— type coercion failed (includes variable name, raw value, expected type, and details)DSUserError— general user-facing schema errors (e.g., conflicts, requires)
When embedding DataSchemer, it is usually sufficient to catch DSUserError (the base class), unless you want custom handling per subtype.
Schema forms: single vs multi-schema
Single schema (no name)
If the schema dictionary contains a top-level variables key, it is treated as a single schema and given the default name "default" (unless you pass an explicit name).
schema = {
"variables": {
"x": {"type": "int", "optional": False},
}
}
p = SchemaProjector(schema, {"x": "3"})
assert p.schema_name == "default"
Multiple schemas with inheritance
A schema dictionary may contain multiple named schemas. A derived schema may list bases in inherit.
schema = {
"base": {
"variables": {
"a": {"type": "int", "optional": False},
}
},
"derived": {
"inherit": ["base"],
"variables": {
"mode": {
"type": "string",
"optional": False,
"choices": ["fast", "accurate"]
}
}
}
}
p = SchemaProjector(schema, {"a": "2", "mode": "fast"}, "derived")
assert p.data["a"] == 2
assert p.data["mode"] == "fast"
Inheritance semantics (current behavior)
- Base schemas are applied first; child schema variables override base variables of the same name.
- Inheritance is recursive: bases may themselves inherit from other schemas.
- If
schema_definitionscontains multiple schemas,schema_nameis required; omitting it raisesDSUserError. - If
schema_namedoes not exist,DSUnknownNameErroris raised with suggested schema names.
Note: the implementation treats
inheritas an ordered list but does not explicitly define a conflict rule when multiple bases define the same variable name. In practice, the recursive update order determines the winner. If you rely on multiple inheritance, it is worth standardizing and documenting a precedence rule (or disallowing ambiguous overlaps).
Variable definitions
A schema’s variables mapping associates each variable name with a definition dictionary.
Common keys include:
type(required; see below)optional(default:False)default(optional variables only)choices(list of allowed values)requires(list of required companion variables)code_alias(rename when constructing objects; see below)help(Description of variables that may be used in documentation)metavar(String passed to documentation)
Variable copy / update / delete / conflict (advanced)
In multi-schema definitions, the current implementation supports schema-level variable transforms:
copy_variables: define a new variable by copying an existing variable from another schema using"schema@var"syntaxupdate_variables: patch selected fields of existing variable definitionsdelete_variables: remove variables from the final merged setconflicts: sets of variables where at most one may be defined
These features are powerful, but they also introduce complexity. If you expect users to rely on them, add a short dedicated section with one concrete example for each.
Supported types
Built-in types are defined in SchemaProjector._type_definitions. The most commonly used are:
Scalars:
intfloatfraction(fractions.Fraction)bool(acceptsTrue/Falseor"true"/"false", case-insensitive)stringnone(identity)
Structured:
string-list(string →[string], list → list of strings)dict-string-to-float-matrixdict-string-to-float-vector
Numeric arrays:
float-vector,float-matrix(support expressions + substitutions)int-vector,int-matrixfraction-vector,fraction-matrixstring-vector,string-matrix
Vector/matrix shaping rules
Array-like types are coerced into NumPy arrays and normalized to shape:
- vectors: always 1D (
reshape(-1)) - matrices:
- scalar →
(1, 1) - 1D →
(1, N) - 2D → unchanged
- higher-D → flattened into rows (
reshape(-1, last_dim))
- scalar →
Arrays, substitutions, and named tokens
For float-vector and float-matrix, numeric coercion supports expressions and header substitutions (see numeric coercion docs).
In addition, SchemaProjector supports a small set of named array tokens for certain array types:
I→ identity matrix (3×3)
Tokens may be prefixed by an integer multiplier:
schema = {"variables": {"m": {"type": "float-matrix", "optional": False}}}
SchemaProjector(schema, {"m": "I"}).data["m"] # identity
SchemaProjector(schema, {"m": "2I"}).data["m"] # 2 × identity
Token matching is currently:
- only for specific array types (
float-matrix,float-vector,int-matrix,int-vector,fraction-matrix,fraction-vector) - only when the raw value is a string
- based on a leading integer multiplier + token name (e.g.
"12I")
If you plan to add more tokens (e.g. O for zeros, E for ones, diag(...), etc.), list them here as they become public API.
Supplying additional choices at runtime
The constructor accepts an optional variable_choices mapping:
p = SchemaProjector(schema, raw, variable_choices={"mode": ["debug", "safe"]})
This extends (or adds) the schema choices for the specified variables.
If variable_choices references an unknown variable name, DSUnknownNameError is raised (with suggestions).
This is useful when a higher-level tool wants to allow extra modes without editing the schema file.
Accessing results
SchemaProjector provides:
data— typed output (dict)raw_data— raw input after defaults are applied (dict)data_tuple— namedtuple view ofdataschema_variables— resolved variable definitions for the chosen schema (merged with inherited variables)schema_name— resolved schema nameconflicts— merged conflict lists (including inherited conflicts)
String renderings (YAML-style):
raw_data_string—render_mapping(self.raw_data)data_string—render_mapping(self.data)
Aliasing and object construction
A variable may define code_alias, which renames it when instantiating a target class:
class Foo:
def __init__(self, x):
self.x = x
schema = {
"variables": {
"a": {"type": "int", "optional": False, "code_alias": "x"},
}
}
p = SchemaProjector(schema, {"a": "5"})
obj = p.get_instance(Foo)
assert obj.x == 5
How get_instance works (current behavior)
- The projector builds
aliased_data: values are renamed according tocode_aliaswhere present. get_instance(cls)inspectscls.__init__and passes only those keys that match constructor parameter names.- Extra data keys are ignored for instantiation purposes (but unknown input keys are still rejected during projection).
This makes it easy to use the same schema both for validation and for building an object without manually wiring parameter names.
Extending domains
You can extend supported types and array tokens by creating a derived projector class:
Projector2 = SchemaProjector.domain_definitions(
type_definitions={"double": lambda x: int(x) * 2},
array_by_name={"Z": "0 0 0 ; 0 0 0 ; 0 0 0"},
)
schema = {"variables": {"a": {"type": "double", "optional": False}}}
p = Projector2(schema, {"a": "4"})
assert p.data["a"] == 8
domain_definitions(...) returns a subclass whose definitions extend the built-ins.
If a name collides, the new definition overrides the old one.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.