
Reputation
Badges 1
25 × Eureka!I think it fails because it tries to install trains twice. Could you remove the trains package, and test? I'm also curious how do you have both installed?!
Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L...
TenseOstrich47 every agent instance has its own venv copy. Obviously every new experiment will remove the old venv and create a new one. Make sense?
Hi @<1569496075083976704:profile|SweetShells3>
Try to do:
import torch.distributed as dist
if dist.get_rank()==0:
task = Task.init(...)
This will make sure only the "master" process is logged
or
if int(os.environ.get('RANK'))==0:
task = Task.init(...)
Hi GrievingTurkey78
I think it is already fixed with 0.17.5, no?
I assume the account name and key refers to the storage account credentials that you can from Azure Storage Explorer?
correct
I am writing quite a bit of documentation on the topic of pipelines. I am happy to share the article here, once my questions are answered and we can make a pull request for the official documentation out of it.
Amazing please share once done, I will make sure we merge it into the docs!
Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with ...
Ohh no I see, yes that makes sense, and I was able to reproduce m thanks!
It seems the code is trying to access an s3 bucket, could that be the case? PanickyMoth78 any chance you can post the full execution log? (Feel free to DM so it won't end up being public)
TrickySheep9 you mean custom containers in clearml-session for remote development ?
the parent task ids is what I originally wanted, remember?
ohh I missed it π
Agreed, MotionlessCoral18 could you open a feature request on the clearml-agent repo please? (I really do not want this feature to get lost, and I'm with you on the importance, lets' make sure we have it configured from the outside)
How does ClearML select reference branch? Could it be that ClearML only checks "origin" branch?
Yes π I think we can quickly fix that, I'm just trying to realize if there are down sides to running "git ls-remote --get-url" without origin
so when inside the docker, I donβt see the git repo and thatβs why ClearML doesnβt see it
Correct ...
I could map the root folder of the repo into the container, but that would mean everything ends up in there
This is the easiest, you can put it on the ENV variable :
None
An upload of 11GB took around 20 hours which cannot be right.
That is very very slow this is 152kbps ...
It should print to console...print(task.get_output_log_web_page())
BoredHedgehog47 you need to configure the clearml k8s glue to spin pods (instead of allocating agents per pods statically) does that make sense ?
, I can see the shape is
[136, 64, 80, 80]
. Is that correct?
Yes that's correct. In case of the name, just try input__0
Notice you also need to convert it to torchscript
Let me know if there is an issue π
LOL, Let me look into it, could it be the calling file is somehow deleted ?
BTW: you will be loosing the comments π
Hi MelancholyBeetle72 , that's a very interesting case. I can totally understand how storing a model and then immediately renaming it breaks the upload. A few questions, is there a way for pytorch lightning not to rename the model? Also I wonder if this scenario happens a lot (storing model and changing it) . I think the best solution is for Trains to create a copy of the file and upload it in the background. That said the name will still end with .part What do you think?
Hi DashingHedgehong5
Is the text the ,labels on the histogram bucket ?
Notice the xlabels
arguments, id this what you are looking for ?
it was uploading fine for most of the day
What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?
Notice that in your execute_remotely() you did not specify a queue to put the current Task into
What it does is it stops the current running code and it puts the newly created task into the specified queue, if you do not specify a queue , it will just abort it, and wait for you to Manually enqueue it.
To solve it:task.execute_remotely(queue_name='my_queue')
s there any way to see datasets uploaded to ClearML Data without downloading them using ClearML Data?
Hi VexedCat68
Currently when you create datasets with clearml-data it has to repackage your files, i.e. upload them. That said we have received numerous requests on "registering data", and we are looking into it.
Here is the main technical hurdles we are facing, and I would love to get your perspective:
If the data is not available locally, we cannot calculate the hash of the conten...
Hi TrickySheep9
You should probably check the new https://github.com/allegroai/clearml-server-helm-cloud-ready helm chart π
https://github.com/allegroai/clearml-server-helm-cloud-ready
It reflects what is stored by Keras, so if Keras stores the best model this is what you get. BTW if you pass output_uri=True it will automatically upload the models