it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...
Hi DilapidatedDucks58 ,
Just making sure all 8 works have different worker ids? (you can see 8 in the workers page in the UI)
Also, are they running this docker or venv mode?
If that's the case check the free space in the monitoring of the experiment, you will find the free space in GB logged
Could you verify you have 8 subfolders named 'venv.X' in the cache folder ~/. trains ?