Data Management

How PM handles data that it intakes and generates.

1: Data Aggregator
2: Database
3: hdf5

PM deals with varied data sets, some of which is read in from other sources and some of which is generated.

1 - Data Aggregator

Data Aggregator is a schema for defining input data, documenting the data, and characterizing relationships between input data.

Like many python-based application, PM will need to take input data from the command line and or YAML input files. A mechanism is needed to characterize the data, enforce the data type of the input data, and possibly do some preprocessing of the data. Data Aggregator (DA) handles these tasks. DA uses descriptors to encode the properties of a collection of data.

Descriptors

Input data is characterized by descriptors. A descriptor describes a set of data associated with some entity, which is normally a class or a script. The descriptor will normally be named by the class or command line interface they correspond to. Below we document the general structure of a descriptor named descriptor_name.

descriptor_name:
  type: generic  # allowed values are generic, class, or script
  description: 
    A description of the collection of variables. If argparse 
    is used, this will be passed on.
  inherit:
    - descriptor1 # inherit properties of descriptor1
    - descriptor2 # inherit properties of descriptor1
  conflicts: 
    - [var1, var2]        # var1 and var2 cannot both be specified
    - [var1, var3, var4]  # var1, var3, var4 cannot all be specified
  variables:  
    var1:
      type: int           # a full list of data type provided elsewhere
      optional: True      # var1 is not required; default is False
      dimension: scalar   # this is only for PM documentation 
      code_alias: v1      # sometimes an alias is used within code
      alias:
        - variable1       # multiple alias can be given for argparse
      choices:            # var1 can only have values of 1 or 2 
        - 1
        - 2
      requires: 
        - var5      # if var1 is provided, var5 must be given too
      default: 5.2  # a default value may be provided
      help: 
        A description of the variable.

Using DA to parse and enforce data types

One degree of functionality for DA is to apply data types to input data. Consider a minimal example of a simple script, where one might have some data in a yaml file that is read in:

import yaml
from principia_materia.io_interface.data_aggregator import DataAggregator

descriptor = yaml.safe_load("""
script_input:
  description:
    An example of a descriptor.
  variables:
    lattice_vectors:
      type: float-array
    scale:
      type: float
      help: a factor used to rescale the lattice vectors.
""")

data = yaml.safe_load("""
lattice_vectors: |
  0 1 1 
  1 0 1
  1 1 0
scale: 4.8
""")

data_ag = DataAggregator(
                'script_input',
                descriptor_definition = descriptor,
                input_data = data)

print (data_ag.data)
# {'lattice_vectors': array([[0., 1., 1.],
#        [1., 0., 1.],
#        [1., 1., 0.]]), 'scale': 4.8}

print (data_ag.data_tuple)
# script_input(lattice_vectors=array([[0., 1., 1.],
#        [1., 0., 1.],
#        [1., 1., 0.]]), scale=4.8)

DA uses the parse_array function to parse strings into arrays, which is documented elsewhere.

Using DA with argparse

DA descriptors contain all the information needed to construct an argparse instance which can be used to get information from the command line. We continue the previous example, but now instead of obtaining data by definition, we will get all variables from the command line. The call to DataAggregator will now not provide any data, but instead provides instructions to use argparse, which is the default behavior.

data_ag = DataAggregator(
                'script_input',
                descriptor_definition = descriptor,
                use_argparse=True)

The default of use_argparse is True, so this variable only needs to be specified if one does not want the overhead of instantiating argparse. We can then call the script, called test_script.py, from the command line:

$ test_script.py --lattice-vectors '0 1 1 , 1 0 1 , 1 1 0' --scale 4.8

It should be emphasized that one can both provide input data and use argparse, and by default argparse takes presidence; allowing one to easily override variables from the command line.

All of the data from the descriptor has been loaded into argparse, and issuing the help flag -h or --help will generate the following:

$ test_script.py -h
# usage: test.py [-h] [--lattice-vectors LATTICE_VECTORS] [--scale SCALE]
# 
# An example of a descriptor.
# 
# options:
#   -h, --help            show this help message and exit
#   --lattice-vectors LATTICE_VECTORS
#   --scale SCALE         a factor used to rescale the lattice vectors.

Using DA with input files

We have now considered directly taking input from yaml string and taking input from the command line via argparse. Given that a common use case will be to take yaml input from some yaml file, DA allows one to specify a variable name which will take in a list of strings which are names of yaml files which are meant to be processed. Consider taking our input and breaking it into two yaml files:

# file1.yaml 
lattice_vectors: |
  0 1 1 
  1 0 1
  1 1 0

# file2.yaml
scale: 4.8

Now we can tell DA that there will be a variable that will contain input file names using the input_files_tag='input_files' variable. Additionally, we will need to add this variable to our descriptor:

descriptor['script_input']['variables']['input_files']={
  'optional':True,
  'type': 'string-list'
  }

data_ag = DataAggregator(
                'script_input',
                descriptor_definition = descriptor,
                input_files_tag = 'input_files'
                use_argparse = True)

We can then call the script:

$ test_script.py --inputs-files file1.yaml,file2.yaml

By default, we have input_files_tag = 'input_files' and use_argparse = True, so these variables do not need to be specified.

Using DA with a required input file

In addition to specifying input files as described above, it is possible to also specify a required input file, which demands that an input file must be given on the command line or piped into standard input. The is specified in the constructor using require_file=True. This would be desirable if there is a script that requires enough data input that an input file will always be used, allowing the user to specify the file name but not use any flags. For example, we could have

# file1.yaml 
lattice_vectors: |
  0 1 1 
  1 0 1
  1 1 0
scale: 4.8

and then any of the three commands could be used:

$ test_script.py   file1.yaml
$ test_script.py < file1.yaml
$ cat file1.yaml | test_script.py

Setting data heirarchy in DA

It should be emphasized that one can simultaneously specify variables through the constructor, the command line, input files, and a required input file, allowing data to be taken from four different sources. The precedent for each source of data is specified by the hierarchy=[0,1,2,3] variable in the constructor, where the number given in the list corresponds to the following data sources:

data_sources = [ input_file_data,    # data from input yaml files
                 required_file_data, # data from a required file
                 args_data,          # data from command line
                 input_data ]        # data supplied through constructor

The hierarchy variable starts with lowest priority and ends with highest priority, which means that the default behavior is that regular input files have lowest priority and data given in the constructor has the highest priority.

Pre-defined descriptors

DA will look for a file called descriptors.yaml in the same directory where data_aggregator.py is located. Descriptors have been created for many classes and command line interfaces in PM.

Instantiating classes using DA

DA can be used to instantiate a class using the aggregated data. Continuing with the above example, we could instantiate the Lattice class, where the constructor requires lattice_vectors.

from principia_materia.translation_group.lattice import Lattice
lattice = data_ag.get_instance(Lattice)

DA will pass all variables that Lattice allows, and withhold the rest.

2 - Database

PM uses a sqlite database to store information related to first-principles calculations.

3 - hdf5

PM uses hdf files to store irreducible derivatives, force tensors, etc.