Memory usage FolderData.open

I am trying to retrieve data from a large file that is stored in the repository. It is a text file and I am trying to read it as follows,

with folder.open('name.dat') as fh:
    data = parse(fh)

If I read the file like this my JupyerLab notebook consumes a lot of memory. So much that if I try to read more than 1 or 2 of these files the notebook crashes. I also tried an alternative approach where a temporary copy of the file gets stored to disk:

with folder.as_path('name.dat') as fh:
     data = parse(fh)

With this second approach the notebook only uses a few MB of memory. For some reason it seems like the first approach loads the entire file into memory. I find this surprising since the documentation seems to explicitly mention that the open function gives a stream-like access to the file contents and should be preferred over other methods such as get_object_content or as_path for large files (Data types — AiiDA 2.6.3 documentation). Does anyone know what might be going wrong here? Is the problem that I am using the open method in text mode? Does that automatically read the entire file into memory? Or should I look into the parse routine that I am using whether that for some reason reads the entire file into memory when it gets passed a file handle instead of a filename?

Yeah, I’d suggest to to post the parse routine here.

Note that the documentation doesn’t really mention the memory usage of the as_path (unless I missed it), it just mentions that it’s ineffient since the files need to be copied in a temporary folder first. So I don’t think the documentation is wrong, but it might be incomplete.

See here a link to the parsing routine: sisl/src/sisl/io/siesta/pdos.py at 7fbb29f21b05b9f6b01ba51e6ff9f22a20306f0a · ahkole/sisl · GitHub

The routine imports a lot of other stuff from the project so I thought posting a link would be easier than trying to paste all the code here.

I call the routine as follows,

with folder.open('aiida.PDOS.xml') as fh:
    data = sisl.get_sile_class('PDOS.xml')(fh).read_data()

This should make sure that the self.fh in the routine is instantiated to the file-handle I get from aiida.

I think @ahkole meant rather that the docs suggest that using open should be memory efficient. In principle it should be.

Before looking into detail into parse, did you try to simply call open and then do nothing with it? Just to see if really just opening it somehow causes the content to be read into memory? That would also pinpoint whether open is the culprit or parse

1 Like

Before looking into detail into parse, did you try to simply call open and then do nothing with it? Just to see if really just opening it somehow causes the content to be read into memory? That would also pinpoint whether open is the culprit or parse

This was a very good suggestion. I tried running this,

with folder.open('aiida.PDOS.xml') as fh:
    print(type(fh))

and my memory usage went from 0.5% to 30% for the jupyter kernel (an increase of around 4-5 GB). So it does really seem to be caused by the usage of open. The type of the fh is also io.StringIO which also indicates that the entire contents gets buffered as a string in memory right?

Interestingly, if I instead use,

with folder.open('aiida.PDOS.xml', 'rb') as fh:
    print(type(fh))

then the resulting type is <class 'disk_objectstore.utils.PackedObjectReader'> and the memory usage does not increase so dramatically (it then hardly increases). So in binary mode it does seem to give a stream that doesn’t immediately read the entire content into memory. Is there a way to wrap this binary stream in such a way to still access the contents from a text stream?

1 Like

Looking at the code, it seems that the “parse” function here is either defusedxml.iterparse or xml.etree.ElementTree.iterparse. I don’t know how they are implemented, but are you sure that they can parse XML in a streaming manner? XML is not really a line base format. For example, to find the closing of the first tag, you would have to read the end of the file. Presumably they read the entire file in memory therefore. But maybe they do something clever.

What puzzles me still is the difference in behavior you report when using open or as_path. The latter simply copies the file from AiiDA’s file repository to a temporary file on disk and then provides the filepath. If you pass this to the XML parser, it will simply open the file by using Python’s built in open most likely. One possible thing is that when the XML lib is parsed one of AiiDA’s file handles, it calls a method of the IO interface that somehow causes too much data to be read, but have no idea which this could be.

Anyway, I still think that the best thing to test first is to just make sure that simply calling FolderData.open doesn’t actually read everything into memory. So do something like:

with folder.open('aiida.PDOS.xml') as fh:
    fh.seek(1)  # Some innocuous operation to make sure we use it but not read too much data.

If that doesn’t spike the memory usage, it should be the XML parser.

Additionally, you could try reading the file line by line and monitoring memory too make sure that works as expected:

with folder.open('aiida.PDOS.xml') as fh:
    while line := fh.readline():
         print(line)

Also, what storage plugin are you using? Please report output of verdi status

Ok, seems you have already answered my questions :smiley: thanks for the additional info. That now makes the problem clear. The issue is here:

All data in AiiDA’s file repository is stored as bytes. So when you open it in text mode, it needs to decode the content to normal text. Unfortunately, the current implementation indeed reads the entire content into memory and wraps it into a io.String buffer.

Instinctively I think that we could add a wrapper layer that implements the IO interface and only reads the content in chunks as methods like read, readline are called and returns it after decoding. Not sure if there might be limitations with some of the underlying object store implementations though. Not sure if this current implementation was done intentionally or just a mistake.

1 Like

Ah okay, thank you for checking. Is there any way that I could myself wrap the binary handle that AiiDA gives you in binary mode? Either with something from the Python standard library or with a simple class that implements the TextIOBase interface? What kind of object is the binary stream that AiiDA gives you in binary mode?

In binary mode, it should return an object that implements the IO interface (see io — Core tools for working with streams — Python 3.13.1 documentation). So you could write a simple wrapper yourself. Something like:

class StreamDecoder:

    def __init__(self, buffer):
        self.buffer = buffer

    def read(size=-1, /):
        return self.buffer.read(size).decode("utf-8")

    def seek(offset, whence=os.SEEK_SET, /)
         return self.buffer.seek(offset, whence)

    # etc

Think the most important methods to implement are read, seek, tell, readline, readlines. But it depends on what your library calls.
You would use it as

with StreamDecoder(folder.open("file.ext", mode="rb")) as buffer:
    assert isinstance(buffer.readline(), str)

Hope that helps

1 Like

Thanks! I’ll give it a try.

Good catch! Indeed we need to fix this.

I think the correct way is to fix our AiiDA code from

yield io.StringIO(handle.read().decode('utf-8'))

to simply using TextIOWrapper?

I guess yield io.TextIOWrapper(handle, encoding='utf-8') but I didn’t test it.

Actually if you can try this out (in AiiDA, or outside in your code to see if this works), then we can make a PR to fix this

I maybe have found a way of doing it without implementing a class myself since the Python standard library also provides a TextIOWrapper class (io — Core tools for working with streams — Python 3.13.1 documentation). However, I could not immediately pass the aiida binary stream to it because it doesn’t implement a few functions that it expects. However, the aiida stream does have an internal _fhandle attribute that does seem to have these functions so I am now exposing them before passing to the TextIOWrapper as follows:

from io import TextIOWrapper

with folder.open('aiida.PDOS.xml', 'rb') as fh:
    fh.readable = fh._fhandle.readable
    fh.writable = fh._fhandle.writable
    fh.closed = fh._fhandle.closed
    fh.flush = fh._fhandle.flush
    fh_text = TextIOWrapper(fh)
    pdos_data = sisl.get_sile_class('PDOS.xml')(fh_text).read_data()

This seems to work and successfully parses the data from the file without a spike in memory. Nevertheless, I wanted to check whether I’m not doing something dangerous or stupid here. Is it safe to use the _fhandle attribute like this? Is there a reason why these methods that TextIOWrapper is expecting are not directly exposed by the file-handle that AiiDA returns?

From what I have tested so far I think this could work if the handle class gets modified to also expose these missing functions that TextIOWrapper expects. Should I try to do this and create a PR for that?

I think the relevant class that would have to be updated is here: disk-objectstore/disk_objectstore/utils.py at 73f14e3953c276268b7f7395ad385e693798e247 · aiidateam/disk-objectstore · GitHub ? Which seems to be a separate project. So PRs to two different projects would then be needed.

Maybe one idea would be to pass on all attribute lookups that are not directly implemented by the class to the underlying file-handle? They do something similar in sisl: sisl/src/sisl/io/sile.py at abce28681dc40f99371d932738820380691b6fa4 · zerothi/sisl · GitHub

Oh we cross-posted :slight_smile:
I think the issue is that the disk-objectstore LazyOpener does not “proxy” those attributes. We should indeed do it - I think it’s for now OK for you to do it, but I would add the methods to disk-objectstore/disk_objectstore/utils.py at 73f14e3953c276268b7f7395ad385e693798e247 · aiidateam/disk-objectstore · GitHub
And hopefully this fixes the problem. Anyone willing to take care of coordinating this set of PRs?

Yes, sorry. I have been adding more information to my previous replies a bit erratically :sweat_smile:

I think the issue is that the disk-objectstore LazyOpener does not “proxy” those attributes. We should indeed do it - I think it’s for now OK for you to do it, but I would add the methods to disk-objectstore/disk_objectstore/utils.py at 73f14e3953c276268b7f7395ad385e693798e247 · aiidateam/disk-objectstore · GitHub

Is it enough to add them to the LazyOpener? For my particular case I actually have a PackedObjectReader (probably because the file in question was packed into a pack file during a regular storage maintain?).

Thanks for the additional info - yes, probably we need to proxy those methods in all relevant classes in the utils.py file of disk-objectstore

@ahkole, thanks for posting this issue and also finding the solution!
I think a more general solution is somehow clear now,

Would you like to open two PRs on aiida-core and disk_objectstore yourself? In that case we can review it :slight_smile:

Hi @ali-khosravi . Yes, I would like to give this a try and open two PRs on aiida-core and disk_objectstore in the coming days.

1 Like

Perfect! looking forward to it :slight_smile:

Hi, I was (finally) able to find some time to try to fix this and I think I got it working. Before creating a PR I wanted to ask a question about a design choice. Right now I’m “proxying” all the missing attributes expected by TextIOWrapper by defining a __getattr__ for LazyOpener and PackedObjectReader, like so

def __getattr__(self, attr) -> Any:
    """For attributes not explicitly defined, return attributes
    of underlying filehandle.
    """
    return getattr(self._fhandle, attr)

with an additional check for LazyOpener that the filehandle exists. Do you think this proxying makes sense for LazyOpener and PackedObjectReader? Or should I only proxy those attributes that are expected by TextIOWrapper and still missing?