PM deals with varied data sets, some of which is read in from other sources and some of which is generated.
This is the multi-page printable view of this section. Click here to print.
Data Management
- 1: Data Aggregator
- 2: Database
- 3: hdf5
1 - Data Aggregator
Like many python-based application, PM will need to take input data from the command line and or YAML input files. A mechanism is needed to characterize the data, enforce the data type of the input data, and possibly do some preprocessing of the data. Data Aggregator (DA) handles these tasks. DA uses descriptors to encode the properties of a collection of data.
Descriptors
Input data is characterized by descriptors. A descriptor describes a set of data associated with some entity, which is normally a class or a script. The descriptor will normally be named by the class or command line interface they correspond to. Below we document the general structure of a descriptor named descriptor_name.
descriptor_name:
type: generic # allowed values are generic, class, or script
description:
A description of the collection of variables. If argparse
is used, this will be passed on.
inherit:
- descriptor1 # inherit properties of descriptor1
- descriptor2 # inherit properties of descriptor1
conflicts:
- [var1, var2] # var1 and var2 cannot both be specified
- [var1, var3, var4] # var1, var3, var4 cannot all be specified
variables:
var1:
type: int # a full list of data type provided elsewhere
optional: True # var1 is not required; default is False
dimension: scalar # this is only for PM documentation
code_alias: v1 # sometimes an alias is used within code
alias:
- variable1 # multiple alias can be given for argparse
choices: # var1 can only have values of 1 or 2
- 1
- 2
requires:
- var5 # if var1 is provided, var5 must be given too
default: 5.2 # a default value may be provided
help:
A description of the variable.
Using DA to parse and enforce data types
One degree of functionality for DA is to apply data types to input data. Consider a minimal example of a simple script, where one might have some data in a yaml file that is read in:
import yaml
from principia_materia.io_interface.data_aggregator import DataAggregator
descriptor = yaml.safe_load("""
script_input:
description:
An example of a descriptor.
variables:
lattice_vectors:
type: float-array
scale:
type: float
help: a factor used to rescale the lattice vectors.
""")
data = yaml.safe_load("""
lattice_vectors: |
0 1 1
1 0 1
1 1 0
scale: 4.8
""")
data_ag = DataAggregator(
'script_input',
descriptor_definition = descriptor,
input_data = data)
print (data_ag.data)
# {'lattice_vectors': array([[0., 1., 1.],
# [1., 0., 1.],
# [1., 1., 0.]]), 'scale': 4.8}
print (data_ag.data_tuple)
# script_input(lattice_vectors=array([[0., 1., 1.],
# [1., 0., 1.],
# [1., 1., 0.]]), scale=4.8)
DA uses the parse_array
function to parse strings into arrays, which is
documented elsewhere.
Using DA with argparse
DA descriptors contain all the information needed to construct an argparse instance which can be used to get information from the command line. We continue the previous example, but now instead of obtaining data by definition, we will get all variables from the command line. The call to DataAggregator will now not provide any data, but instead provides instructions to use argparse, which is the default behavior.
data_ag = DataAggregator(
'script_input',
descriptor_definition = descriptor,
use_argparse=True)
The default of use_argparse
is True, so this variable only needs to be specified
if one does not want the overhead of instantiating argparse.
We can then call the script, called test_script.py
, from the command line:
$ test_script.py --lattice-vectors '0 1 1 , 1 0 1 , 1 1 0' --scale 4.8
It should be emphasized that one can both provide input data and use argparse, and by default argparse takes presidence; allowing one to easily override variables from the command line.
All of the data from the descriptor has been loaded into argparse, and
issuing the help flag -h
or --help
will generate the following:
$ test_script.py -h
# usage: test.py [-h] [--lattice-vectors LATTICE_VECTORS] [--scale SCALE]
#
# An example of a descriptor.
#
# options:
# -h, --help show this help message and exit
# --lattice-vectors LATTICE_VECTORS
# --scale SCALE a factor used to rescale the lattice vectors.
Using DA with input files
We have now considered directly taking input from yaml string and taking input from the command line via argparse. Given that a common use case will be to take yaml input from some yaml file, DA allows one to specify a variable name which will take in a list of strings which are names of yaml files which are meant to be processed. Consider taking our input and breaking it into two yaml files:
# file1.yaml
lattice_vectors: |
0 1 1
1 0 1
1 1 0
# file2.yaml
scale: 4.8
Now we can tell DA that there will be a variable that will contain input file
names using the input_files_tag='input_files'
variable. Additionally, we will
need to add this variable to our descriptor:
descriptor['script_input']['variables']['input_files']={
'optional':True,
'type': 'string-list'
}
data_ag = DataAggregator(
'script_input',
descriptor_definition = descriptor,
input_files_tag = 'input_files'
use_argparse = True)
We can then call the script:
$ test_script.py --inputs-files file1.yaml,file2.yaml
By default, we have input_files_tag = 'input_files'
and use_argparse = True
, so
these variables do not need to be specified.
Using DA with a required input file
In addition to specifying input files as described above, it is possible to also specify
a required input file, which demands that an input file must be given on the command line
or piped into standard input. The is specified in the constructor using require_file=True
.
This would be desirable if there is a script that requires enough data input that an input
file will always be used, allowing the user to specify the file name but not use any flags.
For example, we could have
# file1.yaml
lattice_vectors: |
0 1 1
1 0 1
1 1 0
scale: 4.8
and then any of the three commands could be used:
$ test_script.py file1.yaml
$ test_script.py < file1.yaml
$ cat file1.yaml | test_script.py
Setting data heirarchy in DA
It should be emphasized that one
can simultaneously specify variables through the constructor, the command line,
input files, and a required input file, allowing data to be taken from four different sources.
The precedent for each source of data is specified by the hierarchy=[0,1,2,3]
variable in
the constructor, where the number given in the list corresponds to the following data sources:
data_sources = [ input_file_data, # data from input yaml files
required_file_data, # data from a required file
args_data, # data from command line
input_data ] # data supplied through constructor
The hierarchy variable starts with lowest priority and ends with highest priority, which means that the default behavior is that regular input files have lowest priority and data given in the constructor has the highest priority.
Pre-defined descriptors
DA will look for a file called descriptors.yaml
in the same directory where data_aggregator.py
is located. Descriptors have been created for many classes and command line interfaces in PM.