I am trying to retrieve data from a large file that is stored in the repository. It is a text file and I am trying to read it as follows,
with folder.open('name.dat') as fh:
data = parse(fh)
If I read the file like this my JupyerLab notebook consumes a lot of memory. So much that if I try to read more than 1 or 2 of these files the notebook crashes. I also tried an alternative approach where a temporary copy of the file gets stored to disk:
with folder.as_path('name.dat') as fh:
data = parse(fh)
With this second approach the notebook only uses a few MB of memory. For some reason it seems like the first approach loads the entire file into memory. I find this surprising since the documentation seems to explicitly mention that the open function gives a stream-like access to the file contents and should be preferred over other methods such as get_object_content or as_path for large files (Data types — AiiDA 2.6.3 documentation). Does anyone know what might be going wrong here? Is the problem that I am using the open method in text mode? Does that automatically read the entire file into memory? Or should I look into the parse routine that I am using whether that for some reason reads the entire file into memory when it gets passed a file handle instead of a filename?
Yeah, I’d suggest to to post the parse routine here.
Note that the documentation doesn’t really mention the memory usage of the as_path (unless I missed it), it just mentions that it’s ineffient since the files need to be copied in a temporary folder first. So I don’t think the documentation is wrong, but it might be incomplete.
I think @ahkole meant rather that the docs suggest that using open should be memory efficient. In principle it should be.
Before looking into detail into parse, did you try to simply call open and then do nothing with it? Just to see if really just opening it somehow causes the content to be read into memory? That would also pinpoint whether open is the culprit or parse
Before looking into detail into parse, did you try to simply call open and then do nothing with it? Just to see if really just opening it somehow causes the content to be read into memory? That would also pinpoint whether open is the culprit or parse
This was a very good suggestion. I tried running this,
with folder.open('aiida.PDOS.xml') as fh:
print(type(fh))
and my memory usage went from 0.5% to 30% for the jupyter kernel (an increase of around 4-5 GB). So it does really seem to be caused by the usage of open. The type of the fh is also io.StringIO which also indicates that the entire contents gets buffered as a string in memory right?
Interestingly, if I instead use,
with folder.open('aiida.PDOS.xml', 'rb') as fh:
print(type(fh))
then the resulting type is <class 'disk_objectstore.utils.PackedObjectReader'> and the memory usage does not increase so dramatically (it then hardly increases). So in binary mode it does seem to give a stream that doesn’t immediately read the entire content into memory. Is there a way to wrap this binary stream in such a way to still access the contents from a text stream?
Looking at the code, it seems that the “parse” function here is either defusedxml.iterparse or xml.etree.ElementTree.iterparse. I don’t know how they are implemented, but are you sure that they can parse XML in a streaming manner? XML is not really a line base format. For example, to find the closing of the first tag, you would have to read the end of the file. Presumably they read the entire file in memory therefore. But maybe they do something clever.
What puzzles me still is the difference in behavior you report when using open or as_path. The latter simply copies the file from AiiDA’s file repository to a temporary file on disk and then provides the filepath. If you pass this to the XML parser, it will simply open the file by using Python’s built in open most likely. One possible thing is that when the XML lib is parsed one of AiiDA’s file handles, it calls a method of the IO interface that somehow causes too much data to be read, but have no idea which this could be.
Anyway, I still think that the best thing to test first is to just make sure that simply calling FolderData.open doesn’t actually read everything into memory. So do something like:
with folder.open('aiida.PDOS.xml') as fh:
fh.seek(1) # Some innocuous operation to make sure we use it but not read too much data.
If that doesn’t spike the memory usage, it should be the XML parser.
Additionally, you could try reading the file line by line and monitoring memory too make sure that works as expected:
with folder.open('aiida.PDOS.xml') as fh:
while line := fh.readline():
print(line)
Also, what storage plugin are you using? Please report output of verdi status
Ok, seems you have already answered my questions thanks for the additional info. That now makes the problem clear. The issue is here:
All data in AiiDA’s file repository is stored as bytes. So when you open it in text mode, it needs to decode the content to normal text. Unfortunately, the current implementation indeed reads the entire content into memory and wraps it into a io.String buffer.
Instinctively I think that we could add a wrapper layer that implements the IO interface and only reads the content in chunks as methods like read, readline are called and returns it after decoding. Not sure if there might be limitations with some of the underlying object store implementations though. Not sure if this current implementation was done intentionally or just a mistake.
Ah okay, thank you for checking. Is there any way that I could myself wrap the binary handle that AiiDA gives you in binary mode? Either with something from the Python standard library or with a simple class that implements the TextIOBase interface? What kind of object is the binary stream that AiiDA gives you in binary mode?
Think the most important methods to implement are read, seek, tell, readline, readlines. But it depends on what your library calls.
You would use it as
with StreamDecoder(folder.open("file.ext", mode="rb")) as buffer:
assert isinstance(buffer.readline(), str)
I maybe have found a way of doing it without implementing a class myself since the Python standard library also provides a TextIOWrapper class (io — Core tools for working with streams — Python 3.13.1 documentation). However, I could not immediately pass the aiida binary stream to it because it doesn’t implement a few functions that it expects. However, the aiida stream does have an internal _fhandle attribute that does seem to have these functions so I am now exposing them before passing to the TextIOWrapper as follows:
from io import TextIOWrapper
with folder.open('aiida.PDOS.xml', 'rb') as fh:
fh.readable = fh._fhandle.readable
fh.writable = fh._fhandle.writable
fh.closed = fh._fhandle.closed
fh.flush = fh._fhandle.flush
fh_text = TextIOWrapper(fh)
pdos_data = sisl.get_sile_class('PDOS.xml')(fh_text).read_data()
This seems to work and successfully parses the data from the file without a spike in memory. Nevertheless, I wanted to check whether I’m not doing something dangerous or stupid here. Is it safe to use the _fhandle attribute like this? Is there a reason why these methods that TextIOWrapper is expecting are not directly exposed by the file-handle that AiiDA returns?
From what I have tested so far I think this could work if the handle class gets modified to also expose these missing functions that TextIOWrapper expects. Should I try to do this and create a PR for that?
Is it enough to add them to the LazyOpener? For my particular case I actually have a PackedObjectReader (probably because the file in question was packed into a pack file during a regular storage maintain?).
Hi, I was (finally) able to find some time to try to fix this and I think I got it working. Before creating a PR I wanted to ask a question about a design choice. Right now I’m “proxying” all the missing attributes expected by TextIOWrapper by defining a __getattr__ for LazyOpener and PackedObjectReader, like so
def __getattr__(self, attr) -> Any:
"""For attributes not explicitly defined, return attributes
of underlying filehandle.
"""
return getattr(self._fhandle, attr)
with an additional check for LazyOpener that the filehandle exists. Do you think this proxying makes sense for LazyOpener and PackedObjectReader? Or should I only proxy those attributes that are expected by TextIOWrapper and still missing?