The standard was to manually export your data, then write a script to feed the data to your models for training.
Today we are changing that with the all-new:
Diffgram Streaming — Direct to Memory for Pytorch and Tensorflow
This is huge! But before we get ahead of ourselves, let’s provide some context.
The current state was only manual exports
- Decentralization of the information: Once you download the export, it is completely detached from any system, so you can not be able to track future changes to the same dataset.
- Problems sharing versions. If John generates an export of Dataset A, and then Paul changes some of the files in the dataset. Paul will have to notify John about the changes, otherwise John’s export file will be outdated.
- High resource usage, when having huge amounts of training data, it can be almost impossible to have all the information in memory, so developers have to be aware of memory management or just go and pay for bigger machines.
- Need of transformation scripts to input them into your favorite AI framework like Pytorch or Tensorflow.
The list goes on!
Introducing Streaming Training Data — Direct To Memory
Load Training Data on Demand to Pytorch and Tensorflow
Here we show the difference between before, with manual export, and now with Diffgram data streaming.
Optimize your datasets for ML and say goodbye to boilerplate code. This is the fastest way to get your data for all machine learning tasks including computer vision. A true Data 2.0 format.
With the newest version of the Diffgram SDK, we have updated our
Directory object, so that all the files in this dataset can be directly ingested without the need of generating an export file.
How do we do this?
We’ve made our datasets python iterables that stream each item on demand to your local machine.
Also, we’ve implemented some methods that will easily transform the data into pytorch or tensorflow for easier ingestion on models.
And more! Let’s skip to the examples!
Example Code — Stream ML Training Data
Access a dataset, access an element in it, and connect it to pytorch or tensorflow.
dataset = project.directory.get('my dir')# Stream the first element
file1 = dataset# Loop through all files
for file in dataset:
print(file)# Transform for usage with your favorite framework
pytorch_dataset = dataset.to_pytorch()#OR
tf_dataset = dataset.to_tensorflow()
This gives you the ability to have your training data ready to be ingested by your tensorflow or pytorch models.
Colab Notebook Example
In this notebook you will see a full example on how to stream a big dataset (100k+) images into your training model without having to load the entire data on your local machine. We will use the Pytorch Fast RCNN network.
On Demand Access
Even better, all the data is only transferred to your RAM when needed, totally On Demand.
So if you have a dataset with 100,000 or even 1M images, you won’t need the ram to store 100,000 images + annotations all at once.
Diffgram’s SDK will handle the complexity of feeding the data the model is asking during the training loop. That’s right! Including automatically re-using examples to avoid network calls and more caching goodness.
We have worked on an example that show the basic usage of the new features of the SDK.
Works with Queries
For example, here you can get images with more than 3 cars and at least one pedestrian:
sliced_dataset = dataset.slice(
'labels.cars > 3 and labels.pedestrian >= 1')
Works with many Nodes
Training on multiple nodes? Send slices of data to each machine without the need to access it all first. You can load the data from anywhere and do big scale queries.
Automatic Security benefits
Often big teams have different security models with data in different contexts. By using Diffgram, you can set role based access control once, and have that propagate through to training. This means a single privacy and security model, instead of the data team first trying to grab all the data and then have multiple sets and duplicate data.
Advantages of Streaming Approach of Training Data:
- Instantly start training without any export. And save time on transformation steps too.
- Centralize all the training data in a single place. Now you won’t have 20 different JSON’s of the same dataset. As long as you use the SDK, we’ll guarantee that the latest training data is available there.
- Reduce memory usage during training, you will still be able to train huge datasets without worrying about the resources of the machine. Diffgram will keep a cache of all the data that has been fetched, and discard the old data that has already been given to the model during the training loop.
- Scale every aspect of your training. Stop using JSON, YAML, XML, or other file types to ingest the data to your model. Keep the training data on your cloud, for everyone in the team to access.
That’s just the tip of the iceberg. There are so many more benefits!
Works with the rest of Diffgram
Want to query your data? Annotate it? Bring the full power of open source Diffgram to your team.
Expanding on the earlier example, you can ingest data into Diffgram with the import wizard, explore the data, create customizable automations, and of course our best in class human centered workflows and annotation experiences.
This means you can go from ingest, to annotate, to exploring a slice of your data, to training instantly.
Get it Now
It’s easy to get started with open source Diffgram, install in a few minutes:
3. See the CoLab training data streaming example.
If you already have Diffgram installed, see the update guide, and be sure to pip install -upgrade diffgram sdk.
We want your thoughts!
This is a very new approach, so we are really curious about what you think about this. Try it out and let us know!
- Do you like this idea?
- Do you want more AI frameworks to be supported?
- Do you still prefer JSON exports? (We still have them though)
Let us know in the comments below!
We will keep improving this streaming data approach throughout 2021 and 2022.