Reference¶
StructuredData in general¶
StructuredData is the concept of organizing data in a special hierarchical data structure. First we have to define the terms used in the following chapters.
StructuredData terminology¶
- StructuredData
This is the concept of having data in a hierarchical structure. There is always a top node which is always a collection.
- StructuredDataContainer
This is a StructuredData structure that contains a StructuredDataStore and StructuredDataTypes.
- StructuredDataStore
This is a StructuredData structure that holds your data.
- StructuredDataTypes
This is a StructuredData structure that contains type declarations for a StructuredDataStore.
- node
Either a collection or a scalar.
- scalar
Either a boolean, integer, real or string. A scalar is a simple value with no references. It cannot be referenced and is always contained in a collection.
- boolean
This is a scalar with only two possible Values, True or False. Note that in SDpyshell these two values are True and False with an upper case first letter. In YAML however, the values are true and false (all in small caps).
- integer
An integer number. Note that the range of these numbers is not defined here. We require that the range is at least -2**31 to +2**31.
- real
A floating point number. We require floating point numbers according to the IEEE 754 standard.
- string
A sequence of characters. Unicode characters are supported.
- map
A data structure that maps mapkeys, which are always strings, to values which are always nodes. Note that each mapkey can only be present once in the map. Each mapkey is associated with exactly one node. However, two map keys may be associated with the same node.
- mapkey
This is a key of a map. A mapkey is always a string.
- list
A data structure that is a sequence of nodes. Note that the elements of the list have the order you gave them and that two elements of the list may be equal.
- listindex
This is the index that identifies a member of a list. An listindex is always an integer.
- key
Either a mapkey or a listindex.
- keylist
A list of keys. A keylist is a reference to a node in a StructuredDataStore. It describes how to find the node when you start at the top of the StructuredDataStore. The first key is a identifies a node in the top collection. If this node is a collection, the second key identifies a node in this second collection. If this node is again a collection, the third key identifies a node in this third collection and so on until you finally reach the referenced node.
- path
This is a keylist converted to a string. Basically mapkeys are concatenated with dots ‘.’ while listindices are concatenated after they are enclosed in square brackets. A typical path may look like this “abc.def[4].ghi”. For a precise definition of how paths are constructed see paths.
- pattern
A path that may also contain paths. By definition, all paths are also patterns.
- wildcard
Special keys that match whole classes of keys in a StructuredData structure. “*” matches any mapkey and any listindex while “**” matches one or more mapkey and listindex.
- reference
Collections are never contained in other collections, they are only referenced. It is possible that a collection is referenced by more than one other collection.
- link
A link is a reference to a collection that is already referenced somewhere else.
References and links¶
One is tempted to see collections as containing other collections or scalars, but this is not true in the general case. Collections contain scalars but they actually never contain other collections, they only have references to them. The difference between containing an item and just having a reference is that in a “contains” relationship a collection can only be contained in one other collected where in a “reference” relationship a collection may be referenced by several other collections. “Reference” relationships even allow circles e.g. A referencing B referencing C referencing A.
However, if a collection is referenced by only one other collection, it makes no practical difference if we see this as a “contains” or a “reference” relationship.
If a collection is referenced by at least two other collections, we always have to see this as a “reference” relationship. In the context of StructuredData we call these cases “links” and distinguish them from the ordinary cases.
When querying a StructuredData object links are not recognizable. However, if you apply a change to a collection it makes a big difference whether this collection is referenced at only one or more than other collection. In order to make the work with links easier SDpyshell provides some format parameters and functions that help you to detect links and see what other collections reference a given collection.
Relation of Structured Data to python data structures¶
You may skip this section if you are not familiar with python.
Here is an overview on which terms of the StructuredData definition relate to which python data type:
Structured Data term |
python data type |
---|---|
map |
dict where keys are always strings |
list |
list |
boolean |
bool |
integer |
int |
real |
float |
string |
str |
collection |
either dict or list |
scalar |
an int, a float or a str |
Paths¶
The definition of StructuredData allows to construct a unique path for each node. We construct a path like this:
We start at the top of the StructuredData store and move, key by key towards the node we have selected. We collect the keys we encounter in that order in a list. It is now obvious that this list of keys identifies the node. A path is simply a string representation of that list of keys.
Joining a keylist to a path¶
The rules to construct a path from a list of keys are like this:
If the key is a list index convert it to a string and enclose it in square brackets, e.g index 9 becomes the string “[9]”.
If the key is a map key it must be a string. Apply escape rules to the string.
Combine all converted keys with the “.” character.
If the path contains the sequence “.[” replace it with “[“.
Here are some examples:
list of keys |
path |
---|---|
“A” “B” |
A.B |
“A.B” “C” |
A\.B.C |
“A” 2 “C” |
A[2].C |
“A” “*” “C” |
A\*.C |
“A” ANYKEY “C” |
A.*.C |
Note that “ANYKEY” is a special variable that represents the “*” wildcard as it is used in patterns, for more information on patterns see patterns.
Escape rules¶
The escape rules ensure that any list of map keys and list indices can be represented as a path path and that this list can always be reconstructed from the path. The rules also ensure that a path can not be confused with a pattern containing wildcards.
The escape rules are these:
If the key is “*” change it to “\*”.
If the key is “**” change it to “\**”.
If the key is “#” change it to “\#”
If the key starts with a sequence of “\” followed by either “*”, “**” or “#”, prepend a “\” character.
Replace all occurences of “.” in the key with “\.”.
Replace all occurences of “[” in the key with “\[“.
Replace all occurences of “]” in the key with “\]”.
Here are some examples:
key |
escaped key |
---|---|
A.B |
A\.B |
A.B[5]C |
A\.B\[5\]C |
* |
\* |
** |
\** |
# |
\# |
\* |
\\* |
Example¶
Here is an example of StructuredData (only the StructuredDataStore) formulated in YAML:
item1:
first:
- A
- B
second:
- X
- Y
third:
- m: 1
n: 2
- p: 10
q: 11
If you are familiar with python, this would be the same structure in python:
{ "item1" : { "first": ["A","B"],
"second": ["X","Y"],
"third": [ {"m": 1, "n":2}, {"p":10, "q":11}]
}
}
In the example of StructuredData shown above the following table shows some examples of paths and the data they point to:
path |
data (in python notation) |
---|---|
item1.first |
[“A”,”B”] |
item1.first[1] |
“B” |
item1.second[0] |
“X” |
item1.third |
[ {“m”: 1, “n”:2}, {“p”:10, “q”:11}] |
item1.third[0] |
{“m”: 1, “n”:2} |
item1.third[0].m |
1 |
item1.third[0].n |
2 |
item1.third[1].q |
11 |
Patterns¶
In order to select a subset from a set of paths we define patterns, also called path patterns where it could be confused with other types of patterns. In patterns we combine special keys with ordinary keys. So each path can also be considered as a pattern. These are the special keys that can be used in patterns:
key name |
string representation |
meaning |
---|---|---|
ANYKEY |
* |
matches any key |
ANYKEYS |
** |
matches one or more keys of any value |
ROOTKEY |
# |
used in type patterns for the root type |
Patterns come in two flavours, type patterns and match patterns. For detailed information on type patterns see also StructuredDataTypes.
Here are the differences between both flavours:
flavour |
allowed special keys |
usage |
---|---|---|
type pattern |
ROOTKEY ANYKEY |
type declarations |
match pattern |
ANYKEY ANYKEYS |
matching paths |
Example¶
Here are some examples for match patterns:
Assume that we have the following set of paths:
item1
item1.first
item1.first.A
item1.first.B
item1.second
item1.second.X
item1.second.Y
item1.third
item1.third[0]
item1.third[1]
item1.third[0].m
item1.third[0].n
item1.third[1].p
item1.third[1].q
This is what some patterns match:
wildcard-path |
paths matched |
---|---|
* |
item1 |
item1.* |
item1.first item1.second item1.third |
item1.second.* |
item1.second.X item1.second.Y |
item1.*.* |
item1.first.A item1.first.B item1.second.X item1.second.Y item1.third[0] item1.third[1] |
item1.third[1].* |
item1.third[1].p item1.third[1].q |
item1.third.** |
item1.third[0] item1.third[1] item1.third[0].m item1.third[0].n item1.third[1].p item1.third[1].q |
*.second.* |
item1.second.X item1.second.Y |
StructuredDataStore¶
A StructuredDataStore basically is StructuredData without type declarations. A StructuredDataStore is often embedded in a StructuredDataContainer together with StructuredDataTypes.
StructuredDataTypes¶
The concept of paths allows to reference any part in a StructuredDataStore with a single string. The concept of patterns allows to reference sets of paths and by this sub sets of the StructuredDataStore. For an introduction on patterns see patterns. Here we use a special flavour of patterns called type patterns, for further details on this see type patterns.
A StructuredDataTypes structure maps patterns, which are strings, to type declarations which are simple scalars or nodes. By this StructuredDataTypes is itself StructuredData.
We can now check the types of a StructuredDataStore if they are consistent with the type declarations in StructuredDataTypes. For all paths in the StructuredDataStore we check if we find a matching pattern in StructuredDataTypes. If more than one patterns match, the “best” matching pattern is selected. See also matching typepatterns for details.
If a pattern is found, the corresponding type declaration is checked with the node referenced by the path. We report an error for each path where the type declaration didn’t match.
Differences to programming language type declarations¶
In statically typed programming languages without type inference you have to declare types for all variables and parameters and functions. With StructuredData you can define types partially. It is possible to have no type declarations for parts of the data.
Typepatterns¶
Typepatterns are a flavour of patterns that are used for type declarations. The wildcard “**” (ANYKEYS) is not allowed here. The special path “#” (ROOTKEY) is used to declare the type of the top node since the top node has no path.
Here are some examples of typepatterns:
pattern |
comment |
---|---|
# |
matches the top node |
* |
matches all elements of the top node |
A |
matches element “A” of the top node |
A.B |
matches element “B” of element “A” of the top node |
Typepattern matching¶
During a typecheck the program tries for each path if it finds a matching typepattern in StructuredDataTypes. In order to speed up this process not all typepatterns are examined but only those who have the same length as the path. For this reason “**” is not allowed in typepatterns since it would also match longer paths. The details of the typepattern matching algorithm are important if more than one typepattern would match the path. The algorithm determines which of the matching typepatterns is selected for the actual typecheck.
At each stage a directly matching key in a typepattern has precedence over a wildcard. If a matching typepattern is found, the other typepatterns are not searched.
Here are some examples with a path, some typepatterns and an indicator which typepattern is found by the match algorithm:
path |
typepatterns |
matched |
---|---|---|
X.B.D |
*.*.D |
X |
*.B.C |
||
X.A.* |
||
X.B.D |
X.B.* |
|
X.B.D |
X |
|
X.B.D |
X.*.* |
X |
*.B.D |
Type declarations¶
This is the list of currently known type declarations, note that we write the type declaration in YAML syntax here:
boolean¶
A boolean. A scalar of type boolean has only two possible Values, True or False. Note that in SDpyshell these two values are True and False. In YAML however, the values or true and false (all in small caps). This data type is represented with the string:
boolean
integer¶
An integer number. Note that the range of these numbers is not defined here. We assume that the range is at least -2**31 to +2**31. This data type is represented with the string:
integer
real¶
A floating point number. We assume floating point numbers according to the IEEE 754 standard.
This data type is represented with the string:
real
string¶
A sequence of characters. Unicode characters are supported.
This data type is represented with the string:
string
optional struct¶
This is a map where all map keys must be elements of the list provided in the type declaration.
This data type is represented as a map with just one key and a list as value. Here is the representation of it in YAML, there can be an arbitrary number of map keys:
optional_struct:
- map_key1
- map_key2
open struct¶
This is a map where all elements of the list provided in the type declaration must be present as map keys. The map may however, have other additional keys.
This data type is represented as a map with just one key and a list as value. Here is the representation of it in YAML, there can be an arbitrary number of map keys:
open_struct:
- map_key1
- map_key2
struct¶
This is a map where all elements of the list provided in the type declaration must be present as map keys. No other keys are allowed in the map than the elements of the list.
This data type is represented as a map with just one key and a list as value. Here is the representation of it in YAML, there can be an arbitrary number of map keys:
struct:
- map_key1
- map_key2
typed map¶
This is a map where each value must be of the type scalar_type. scalar_type is either “boolean”, “integer”, “real” or “string”.
This data type is represented as a map with just one key and a string as value. The value must be one of the strings “boolean”, “integer”, “real” or “string”. Here is a representation in YAML which requires that all map values must be integers:
typed_map: integer
map¶
This is a map with no further restrictions (aside from that map keys must be strings).
This data type is represented with the string:
map
optional list¶
This is a list where all list elements must be elements of the list provided in the type declaration.
This data type is represented as a map with just one key and a list as value. Here is the representation of it in YAML, there can be an arbitrary number of values:
optional_list:
- value1
- value2
typed list¶
This is a list where each value must be of the type scalar_type. scalar_type is either “boolean”, “integer”, “real” or “string”.
This data type is represented as a map with just one key and a string as value. The value must be one of the strings “boolean”, “integer”, “real” or “string”. Here is a representation in YAML which requires that all list elements must be integers:
typed_list: integer
list¶
This is simply a list with no further restrictions.
This data type is represented with the string:
list
StructuredDataContainer¶
A StructuredDataContainer contains a StructuredDataStore and optionally StructuredDataTypes. When a StructuredDataContainer is stored in a file, it is stored in YAML format. Here is an example how such a file looks like:
'**SDC-Metadata**':
version: '1.0'
'**SDC-Store**':
key1: 1
key2:
A: x
B: y
key3:
- 1
- 2
- 3
- float: 1.23
'**SDC-Types**':
'#':
struct:
- key1
- key2
- key3
'*.key1': integer
'*.key2':
optional_struct:
- A
- B
- C
'*.key2.*': string
'*.key3':
typed_list: integer
A StructuredDataContainer consists of three parts, the metadata, the StructuredDataStore and the StructuredDataTypes.
- metadata
This is meta information on the file. Currently it only contains the version number of the file format. It is everything below the key “**SDC-Metadata**”.
- StructuredDataStore
This is the part of the file where the data is stored. It is everything below the key “**SDC-Store**”.
- StructuredDataTypes
Here are the type declarations. Type declarations are explained in more detail further below in this file. For now we just remember that type declarations consist of paths and types. A path is a string that identifies a position in the store. The “#” is the root symbol, it is used to define the type for the topmost part of the StructuredDataStore. The “*” characters are wildcards, similar to the “*” used in file systems, they match any string at that position. Note that the store and the types may reside in two different files.