Are you sure you added the pytorch channel in clearml.conf ?
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L64
Hi YummyMoth34 they will keep on trying to send reports.
I think they try for at least several hours.
What's the error you are getting ?
SubstantialElk6
The CA is taken automatically by urllib, check the OS environments you need to configure it.
https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-errorSSL_CERT_FILE REQUESTS_CA_BUNDLE
Hi GreasyPenguin14
Yes, I think you are right the series name should be next to the title. Let me check it...
try:
None
docker_install_opencv_libs: true
I have timeseries dataset with dimension 1,60,1 which the first dimension is number of data, the second one is timestep
I think it should be --input-size 1 60 ` if the last dimension is the batch size?
(BTW: this goes directly to Triton configuration, it is the information Triton needs in order to run the model itself)
BeefyCow3 if you are trying to optimizer a specific metric (i.e. a scalar on a graph). The template Task should report it with the same title/series combination, which should be easy enough to verify in the UI 🙂
You can either report with Tensorboard or with the Trains Logger, either way will work.
Hi GrievingTurkey78
Could you provide some more details on your use case, and what's expected?
Hi JitteryCoyote63
If you want to stop the Task, click Abort (Reset will not stop the task or restart it, it will just clear the outputs and let you edit the Task itself) I think we witnessed something like that due to DataLoaders multiprocessing issues, and I think the solution was to add 'multiprocessing_context='forkserver' to the DataLoaderhttps://github.com/allegroai/clearml/issues/207#issuecomment-702422291
Could you verify?
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
For some reason it detect CUDA 11.1 (I assume this is what you have installed, the driver CUDA version is the highest it will support not necessary what you have installed)
Notice that you have to Have the task already started by the Master process
the storage configuration appears to have changed quite a bit.
Yes I think this is part of an the cloud ready effort.
I think you can find the definitions here:
https://artifacthub.io/packages/helm/allegroai/clearml
Hi RipeGoose2
What exactly is being uploaded ? Are those the actual model weights or intermediate files ?
Oh that is odd... let me check something
Hi DeliciousBluewhale87
You mean per Task? Is it reporting? Is it like the project overview?
Sorry my bad, you are looking for:
None
We should probably have a section on that (i.e. running two agents on the same GPU, then explain how top use it)
Hi @<1523708901155934208:profile|SubstantialBaldeagle49>
If you report on the same iteration with the same title/series you are essentially overwriting the data (as expected)
Regrading the plotly report size.
Two options:
- round down numbers (by default it will store all the digits, and usually after the forth it's quite useless, and it will drastically decrease the plot size)
- Use logger.report_scatter2d , it is more efficient and has a mechanism to subsample extremely large graphs.
In theory yes, in practice you will be using the same docker image for all the services, and they will never interfere with one another. and you have the option to do more sophisticated stuff, like map the file-server data for a clean up service (should be out in a few days :)) so a balance. Also remember that relatively speaking docker are quite light weight, this is not like saying a VM per service...
or me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.
Do notice the autoscaler code itself needs to run somewhere, by default it will be running on your machine, or on a remote agent,
As a hack you can try DEFAULT_VERSION
(it's just a flag and should basically do Store)
EDIT: sorry that won't work 😞
I assume now it downloads "more" data as this is running in parallel (and yes I assume that before it deleted the files it did not need)
But actually, at east from a first glance, I do not think it should download it at all...
Could it be that the "run_model_path" is a "complex" object of a sort, and it needs to test the values inside ?
Based on the log you have shared:OSError: [Errno 28] No space left on device
I would increase the storage ?
https://github.community/t/github-actions-failing-with-errno-28-no-space-left-on-device/18164/10
https://stackoverflow.com/questions/70175977/multiprocessing-no-space-left-on-device
https://groups.google.com/g/ansible-project/c/4U6MyvyvthQ
I would start by increasing the size of the TMPDIR folder
Hi LackadaisicalOtter14
Is it possible to remove this line to stop it from being executed
Everything is possible 🙂 II think the main question is why it is there (which ti the best of my understanding, is to solve for any cuda drivers and installed packages, meaning anything that is installed in runtime)
I think we can suppress the error, wdyt?'echo "ldconfig" 2>/dev/null >> /etc/profile && '