I’m trying to get a sense of how the “contains” query works for list/dictionary types. Let’s focus on lists first. Suppose the following:
orm.QueryBuilder().append(
orm.StructureData,
filters={
"attributes.cell": {
"contains": [[1.0, 0.0, 1.0]],
}
}
)
I would think this means that the cell contains the vector [1.0, 0.0, 1.0]
. However, the query fetches a result with a cell [[1.0, 0.0, 0.0], [0.0, 2.0, 0.0], [0.0, 0.0, 3.0]]
. This suggests it’s doing some sort of partial check through each element? Unclear.
Also, the documentation itself is a bit ambiguous.
Contains all? Any? Note that neither seems to explain the above behavior.
A bit of digging suggests that “contains” points to SQLAlchemy’s sqlalchemy.dialects.postgresql.JSONB.Comparator.contains
(I think). If so, the docs say
“Boolean expression. Test if keys (or array) are a superset of/contained the keys of the argument jsonb expression.”
Not sure what to make of it.
We need to check the actual query in postgres.
You can store the query builder object e.g. in qb
and then run print(qb.as_sql(inline=True))
. And then we need to check the PostgreSQL docs.
From some quick testing, I think that:
{'contains': ['a', 'b']}
means that it should match if there are both elements in the list
- if you put one more list, I tried with this example:
from aiida import orm
n = orm.Dict({'test': [['a', 'b', 'c'], ['a', 'd', 'f']]})
n.store()
Then, I tried various queries similar to print(orm.QueryBuilder().append(orm.Dict, filters={"attributes.test": {"contains": [["f"], ["b"]]}}).all(flat=True))
.
I think that:
"contains": [["a", "d"]]
means that there is at least an element in the list with both elements. This matches (there is ['a', 'd', 'f']
)
- So,
"contains": [["f"]]
means at least one element in the list is a list with the element f.
"contains": [["f"], ["b"]]
then means that both conditions apply: there is at least one element in the list with f
, and at least one (the same or another) element in the same list with the element “b”. This matches, again.
You can try with adding other letters to see what matches and what not. And then we should double check Postres to confirm our intuition. But this should explain your results: "contains": [[1.0, 0.0, 1.0]]
means:
In the list of lists, there is at least one internal list that matches “[1.0, 0.0, 1.0]”, and the latter statement means: at least one internal list has both elements 0.0 and 1.0.
(which is not the query you had in mind, but can be at least explained).
As a final note, let’s be very careful of querying float numbers, because of the finite precision (I’m not sure of how “=” compares two float numbers in postgres).
(And note that in the SQLite backend, contains
is not supported).
Thanks @giovannipizzi. That’s very peculiar behavior. Is this the behavior we wish to reflect in the QueryBuilder? As for floats, that’s interesting. I think I’ll take a deeper look at this. Do we presently not handle floats comparisons?
I agree that is not what one would expect for the cell. But for general lists of lists, I see how the current behaviour covers at least some usecases. Anyway, I think we’ll be limited by the actual query capabilities of PostgreSQL, so we need first to check what is possible in raw Postgres SQL syntax.
If one really cares, it’s still possible to make a relatively complex query to achieve the goal you want, by making enough and/or filters. But again, do we really need it?
I don’t think you ever want to query for the first vector being exactly [1, 0, 0].
You might care if the system is cubic, for instance. And then you could do e.g. as sequence of and statements, checking that the 6 off-diagonal components are zero. But in this case I think it’s best to add extras e.g. with the spacegroup, and then query for those.
Float comparisons are OK and should work as intended. It’s equality that is tricky (not only for us, but for anything on a computer with float numbers).
Just try this in you python…
(4/3 - 1) == 1/3
(it should be True, but it will return False, or at least it does for me, as they differ by ~1e.-17).
You typically want to check if the difference in abs value is < some threshold (e.g. 1.e-6 or smaller, depending on the application).