Reputation
Badges 1
25 × Eureka!Yey!
My pleasure 🙂
Regrading the first direction, this was just pushed 🙂
https://github.com/allegroai/clearml/commit/597a7ed05e2376ec48604465cf5ebd752cebae9c
Regrading the opposite direction:
That is a good question, I really like the idea of just adding another section named Datasets
SucculentBeetle7 should we do that automatically?
BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space 🙂
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:system_site_packages: true
- Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0But torchvision is downloaded from the cuda 11 folder...
I...
Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18
Hmm that is odd, it seemed to missed the fact this is a jupyter notbook.
What's the clearml version you are using ?
IdealPanda97 hmmm interesting, what's exactly the scenario here?
Hi @<1541592204353474560:profile|GhastlySeaurchin98>
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
Hmm, yes this fits the message. Which basically says that it gave up on analyzing the code because it run out of time. Is the execution very short? Or the repo very large?
LuckyRabbit93 We do!!!
quick update, still trying to reproduce ...
Hi BoredHedgehog47 I'm assuming the nginx on the k8s ingest is refusing the upload to the files server
JuicyFox94 wdyt?
to avoid downgrade to clearml==1.9.1
I will make sure this is solved in clearml==1.9.3 & clearml-session==0.5.0 quickly
What about Calling Taskl.init Without the agent?
Nice SubstantialElk6 !
BTW: you can configure your cleaml client to store the changes from the latest Pushed commit (and not the default which is latest local commit)
see store_code_diff_from_remote: in clearml.conf:
https://github.com/allegroai/clearml/blob/9b962bae4b1ccc448e1807e1688fe193454c1da1/docs/clearml.conf#L150
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...
Hi PanickyMoth78
Yes i think you are correct, this looks like gs throttling your connection. You can control the number of concurrent uploads with max_worker=1
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/datasets/dataset.py#L604
Let me know if it works
Hi CrookedAlligator14
Hi, I just started using clearml, and it is amazing!
Thank you! 😍
When I enqueue the task, the venv is setup and starts to install all the packages from the
requirements.txt
file, but at the end I get the following in the console:
Can you try with the latest agent, we improved the support for pytorch (they now have a proper pypi compatible repo), can you see if that solves it?pip3 install clearml-agent==1.5.0rc0
SubstantialElk6
Hmm do you have torch in the "installed packages" section of the Task ?
(This what the agent is using to setup the environment inside the docker, running as a pod)
MoodyCentipede68 from your log
clearml-serving-triton | E0620 03:08:27.822945 41 model_repository_manager.cc:1234] failed to load 'test_model_lstm2' version 1: Invalid argument: unexpected inference output 'dense', allowed outputs are: time_distributed
This seems the main issue of triton failing to.load
Does that make sense to you? how did you configure the endpoint model?
SourSwallow36 it is possible.
Assuming you are not logging metrics by the same name, it should work.
try:Task.init('examples', 'training', continue_last_task='<previous_task_id_here>')
Hi @<1539055479878062080:profile|FranticLobster21>
Like this?
https://github.com/allegroai/clearml/blob/4ebe714165cfdacdcc48b8cf6cc5bddb3c15a89f[…]ation/hyper-parameter-optimization/hyper_parameter_optimizer.py
[https://github.com/allegroai/clearml/blob/4ebe714165cfdacdcc48b8cf6cc5bddb3c15a89f[…]ation/hyper-parameter-opt...
Sorry my bad, you are looking for:
None
I am using importlib and this is probably why everythings weird.
Yes that will explain a lot 🙂
No worries, glad to hear it worked out
BTW: I think an easy fix could be:if running_remotely(): pipeline.start() else: pipeline.create_draft()
I see.
You can get the offline folder programmatically then copy the folder content (it's the same as the zip, and you can also pass a folder instead of zip to the import function)task.get_offline_mode_folder()You can also have a soft link of the offline folder (if you are working on a linux machine:ln -s myoffline_folder ~/.trains/cache/offline
StraightDog31 how did you get these ?
It seems like it is coming from maptplotlib, no?
The downstream stages are rankN scripts, they are waiting for the IP address of the first stage.
Is this like a multi-node training, rather than a pipeline ?