and if you add --skip-task-init
?
I think what happens is that the clearml-Task, adds a Task.init
call without the output_uri
that is called before "your" Task.init, and this is what causes it to be ignored. Could that be the case?
WackyRabbit7 my apologies for the lack of background in my answer π
Let me start from the top, one of the goal of the trains-agent is to reproduce the "original" execution environment. Once that is done, it will launch the code and monitor it. In order to reproduce the original execution environment, trains-agent will install all the needed python packages, pull the code, and apply the uncommitted changes.
If your entire environment is python based, then virtual-environment mode is proba...
Hi SoreDragonfly16
The warning you mention means that someone state of the experiment was changed to aborted
, which in term will actually kill the process.
What do you mean by "If I disable the logger," ?
Okay, so the idea behind the new decorator is not to group all the defined steps under the same script so that they share the same environment, but rather to simplify the process of creating scripts for each step and avoid manually callingΒ
Task.init
Β on those scripts.
Correct, and allow users to more easily create Tasks from code.
Regarding virtual environment creation from caching, I will keep running benchmarks (from what you say it might be due to high workload ...
No worries, just wanted to make sure it doesn't slip away π
Hi @<1578555761724755968:profile|GrievingKoala83>
mount s3 as a cache folder
I'm not sure that would be fast enough for cache ...
How to override
/root/.cache/pip
path?
in your clearml.conf fille:
None
then set it to your PV
ClumsyElephant70 the odd thing is the error here:docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.
I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
)?
Hi ContemplativeCockroach39
Seems like you are running the exact code as in the git repo:
Basically it points you to the exact repository https://github.com/allegroai/clearml and the script examples/reporting/pandas_reporting.py
Specifically:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/reporting/pandas_reporting.py
DeliciousSeal67 the agent will use the "install packages" section in order to install packages for the code. If you clear the entire section (you can do that in the UI or programmatically) then it will revert to requirementsd.txt
Make sense ?
VexedCat68 actually a few users already suggested we auto log the dataset ID used as an additional configuration section, wdyt?
I think your use case is the original idea behind "use_current_task" option, it was basically designed to connect code that creates the Dataset together with the dataset itself.
I think the only caveat in the current implementation is that it should "move" the current Task into the dataset project / set the name. wdyt?
try to break it into parts and understand what produces the error
for example:increase(test12_model_custom:Glucose_bucket[1m])
increase(test12_model_custom:Glucose_sum[1m])
increase(test12_model_custom:Glucose_bucket[1m])/increase(test12_model_custom:Glucose_sum[1m])
and so on
The problem is of course filling in all the configuration details, so that they are viewable.
Other than that, check out:
https://allegro.ai/docs/task.html#trains.task.Task.export_task
https://allegro.ai/docs/task.html#trains.task.Task.import_task
Sounds good ?
Hmm can you run the agent in debug mode, and check the specific console log?
'''
clearml-agent --debug daemon --foreground ...
But it should work out of the box ...
Yes it should ....
The user and personal access token are used as is and it propagates down to submodules, since those are simply another git repository.
Can you manually successfully run:git clone --recursive
https://user:token@github.com/company/repo_with_submodules
And you are seeing a bunch of the GS SSL errors?
Hi SkinnyPanda43
Let's say that I install the shared libs with pip in editable mode on my development evironment, how does the clearml-agent will handle those libraries if I submit a job
So installing packages from local folders with "-e" is in general ill-advised.
But using a full git path should work out of the box. for example if you install pip install
https://github.com/user/repo/repo.git then the agent will be able to install it on the remote machine. The main challenge...
but why is it mounted only once?
Are you saying the second time this line is missing? this is very strange...
Can you send the full Task log?
Why can we even change the pip version in the clearml.conf?
LOL mistakes learned the hard way π
Basically too many times in the past pip versions were a bit broken, which is fine if they are used manually and users can reinstall a diff version, but horrible when you have an automated process like the agent, so we added a "freeze version" option, only with greater control. Make sense ?
Then we can figure out what can be changed so CML correctly registers process failures with Hydra
JumpyPig73 quick question, the state of the Task changes immediately when it crashes ? are you running it with an agent (that hydra triggers) ?
If this is vanilla clearml with Hydra runners, what I suspect happens is Hydra is overriding the signal callback hydra adds (like hydra clearml needs to figure out of the process crashed), then what happens is that clearml's callback is never cal...
Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)
Was trying to figure out how the method knows that the docker image ID belongs to ECR. Do you have any insight into that?
Basically you should have the docker service login before running the agent, then the agent uses docker to run the image from the ECR.
Make sense ?
Hi TenseOstrich47 whats the matplotlib version and clearml version you are using ?
Hi HealthyStarfish45
Funny just today I had a similar discussion on slurm:
https://allegroai-trains.slack.com/archives/CTK20V944/p1603794531453000
Anyhow, when you say "[scale up agents]" are you referring to a machine constantly running an agent pulling jobs from the queue, where the machine itself (aka the resource) is managed as a slurm job?
Thanks a lot. I meant running a bash script after cloning the repository and setting the environment
Hmm that is currently not supported π
The main issue in adding support is where to store this bash script...
Perhaps somewhere inside clear ml there is an order of actions for starting that can be changed?
Not that I can think of,
but let's assume you could have such a thing, what would you have put in the bash script (basically I want to see maybe there is a worka...
(without having to execute it first on Machine C)
Someone some where has to create the definition of the environment...
The easiest to go about it is to execute it one.
You can add to your code the following linetask.execute_remotely(queue_name='default')
This will cause you code to stop running and enqueue itself on a specific queue.
Quite useful if you want to make sure everything works, (like run a single step) then continue on another machine.
Notice that switching between cpu...
hmm... try to run the trains-agent from the ml
environment with "system_site_packages: true", it might do the trick. Anyhow please let me know if it worked π