Reputation
Badges 1
25 × Eureka!Yes, that sounds like the issue, is the file actually there ?
LazyTurkey38 , ohh I think you are correct 😞
it should be:# patch the Task and actually send it for execution if Task.running_locally(): # this will verify all auto repo detection and python is done. task.close() # so that we can edit the task task.reset() # update the repo task.update_task(task_data={'script': {'branch': 'new_branch', 'repository': 'new_repo'}}) # now to actually enqueue the Task Task.enqueue(task, queue_name='default')
wdyt?
Can you fix locally, just to verify ?
SmugLizard25 are you saying that with the latest version it does not work?
But this is not copy, this is mount, your log showed cp failing
Hi @<1610083503607648256:profile|DiminutiveToad80>
do you have a full log? can you share the code you are trying to run?
Expected behaviour is that it reads last iteration correctly. At least it is stated in docs so.
This is exactly what should happen, are you saying that for some reason it fails?
Also, how do pipelines compare here?
Pipelines are a type of Task, so like Tasks you can clone and enqueue them, or set them as the target of the trigger.
the most flexible solution would be to have some way of triggering the execution of a script in the parent task environment,
This is the exact idea of the TriggerScheduler None
What am I missing here?
Yes it should
here is fastai example, just in case 🙂
https://github.com/allegroai/clearml/blob/master/examples/frameworks/fastai/fastai_with_tensorboard_example.py
HighOtter69
By default if you are continuing an experiment it will start from the last iteration of the previous run. you can reset it with:task.set_initial_iteration(0)
Working on it as we speak 🙂 Hopefully in the next release (probably next week)
Sadly no 😞
(I mean you could quickly write a reader for TB and report it, but it is not built into the SDK)
Hi PanickyMoth78
Yes i think you are correct, this looks like gs throttling your connection. You can control the number of concurrent uploads with max_worker=1
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/datasets/dataset.py#L604
Let me know if it works
And you are seeing a bunch of the GS SSL errors?
Would the training script be able to register to the server when there is no public IP.
I guess it's more related to networking inside GCP, but just wanted to know if anyone tried it.
Hmm, as long as you can access it from within the LAN, it should work ...
Please update when you try it out 🙂
Hmmm, yes we should definitely add --debug (if you can, please add a GitHub issue so it is not forgotten).
FiercePenguin76 Specifically are you able to ssh manually to <external_address>:<external_ssh_port> ?
Oh in that case add --remote-gateway <external_ip>
It will connect to the provided address instead of the local one. (you can also add --public-ip
which will automatically resolve the public IP of the server
ngrok to connect to the remote server at the office?
That makes sense, I guess this is the equivalent of using a VPN, from that point onward clearml-session can directly access the remote machine, right?
This is the prerequisites of the docker service installed on the host machine (where the agent is running)
Basically follow: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://docs.docker.com/compose/gpu-support/
Hi GiganticTurtle0
ClearML will only list the directly imported packaged (not their requirements), meaning in your case it will only list "tf_funcs" (which you imported).
But I do not think there is a package named "tf_funcs" right ?
Correct.
It starts with the initial script (entry point), if it is self contained (i.e. does not interact with the rest of the repo) it will only analyze it, otherwise it will analyze the entire repo code.
I get what you're saying. Only problem is in the case of AutoLogging, I don't have the model id, for the model being saved.
Task.models['output'] should return all the model objects the autologging created