Shuffle method on custom datastore written for a single binary file
I am writing a custom datastore and am seeking some assistance. My datasets consist of stacks of 2D images (frames) stored sequentially in a single binary file. While it’s very straight forward to read in the binary stream using fread, each full dataset itself can easily be on the order of 50+ GB, making it infeasible to load everything at once on the hardware equipment I have available. This was my original motivation for exploring the use of a datastore.
In addition to the need for managing out-of-memory data, I also would like to partition the data into chunks where each chunk contains a random collection of frames from this binary file. If possible, I would like to use the shuffle method for the datastore superclass to accomplish this, as this seems to be the "proper" approach (although I’m very open to alternatives).
The problem I am currently having is that the default datastore shuffle method appears only to randomize the order of files in a datastore directory. However, since I only have one (very large) binary file, it doesn’t seem to "shuffle" anything at all – running readall on the shuffled datastore returns the exact same data as if I were to run it on the original datastore. I would rather need it to "shuffle" the frames within the binary file. Presumably, if I were to save each frame as an individual image file on disk, then I could get this to work using imageDatastore or fileDatastore. However, then I would have to go through all my files and save them to disk again as individual files, which seems rather silly.
I have written code to load a chunk of the data manually by jumping around the file using fseek. However, then I lose access to the datastore object as well as its built-in functionality. So I thought I would throw this question out there to see if anyone could offer some help.I am writing a custom datastore and am seeking some assistance. My datasets consist of stacks of 2D images (frames) stored sequentially in a single binary file. While it’s very straight forward to read in the binary stream using fread, each full dataset itself can easily be on the order of 50+ GB, making it infeasible to load everything at once on the hardware equipment I have available. This was my original motivation for exploring the use of a datastore.
In addition to the need for managing out-of-memory data, I also would like to partition the data into chunks where each chunk contains a random collection of frames from this binary file. If possible, I would like to use the shuffle method for the datastore superclass to accomplish this, as this seems to be the "proper" approach (although I’m very open to alternatives).
The problem I am currently having is that the default datastore shuffle method appears only to randomize the order of files in a datastore directory. However, since I only have one (very large) binary file, it doesn’t seem to "shuffle" anything at all – running readall on the shuffled datastore returns the exact same data as if I were to run it on the original datastore. I would rather need it to "shuffle" the frames within the binary file. Presumably, if I were to save each frame as an individual image file on disk, then I could get this to work using imageDatastore or fileDatastore. However, then I would have to go through all my files and save them to disk again as individual files, which seems rather silly.
I have written code to load a chunk of the data manually by jumping around the file using fseek. However, then I lose access to the datastore object as well as its built-in functionality. So I thought I would throw this question out there to see if anyone could offer some help. I am writing a custom datastore and am seeking some assistance. My datasets consist of stacks of 2D images (frames) stored sequentially in a single binary file. While it’s very straight forward to read in the binary stream using fread, each full dataset itself can easily be on the order of 50+ GB, making it infeasible to load everything at once on the hardware equipment I have available. This was my original motivation for exploring the use of a datastore.
In addition to the need for managing out-of-memory data, I also would like to partition the data into chunks where each chunk contains a random collection of frames from this binary file. If possible, I would like to use the shuffle method for the datastore superclass to accomplish this, as this seems to be the "proper" approach (although I’m very open to alternatives).
The problem I am currently having is that the default datastore shuffle method appears only to randomize the order of files in a datastore directory. However, since I only have one (very large) binary file, it doesn’t seem to "shuffle" anything at all – running readall on the shuffled datastore returns the exact same data as if I were to run it on the original datastore. I would rather need it to "shuffle" the frames within the binary file. Presumably, if I were to save each frame as an individual image file on disk, then I could get this to work using imageDatastore or fileDatastore. However, then I would have to go through all my files and save them to disk again as individual files, which seems rather silly.
I have written code to load a chunk of the data manually by jumping around the file using fseek. However, then I lose access to the datastore object as well as its built-in functionality. So I thought I would throw this question out there to see if anyone could offer some help. datastore, binary, shuffle, image stack, big data, large file, data import MATLAB Answers — New Questions