PompousBeetle71 could you check that the "output:destination" is the same for both experiments ?
The issue is uploading reporting fro http uploads (object storage will report upload). Basically the http upload is post with urllib that does not support upload callbacks for progress report. If you have an idea here, we will gladly add it (as you mentioned it can be quite annoying to have to open network manager to verify the upload is progressing)
GrievingTurkey78 I see,
Basically the arguments after the -m src.train
in the remote execution should be ignored (they are not needed).
Change the m in the Args section under the configuration. Let me know if it solved it.
Just making sure i understand, you are to upload your models with clearml to the Yandex compatible s3 storage?
HurtWoodpecker30
The agent uses the
requirements.txt
)
what do you mean by that? aren't the package listed in the "Installed packages" section of the Task?
(or is it empty when starting, i.e. it uses the requirements.txt from the github, and then the agent lists them back into the Task)
Oh my bad, post 0.17.5 😞
RC will be out soon, in the meantime you can install directly from github:pip install git+
ProudMosquito87 Just a few pointers on how we convert the TB histograms to awesome (but less accurate) 3D surfaces.
First I have to admit, I almost never use these histograms, maybe to detect a plateau of if something goes really wrong...
The 3D surface is basically grouping all the histograms and then bucketing them (I think the default is 50 buckets) so that you get a general feel of what's going on, not necessary a detailed view. Bottom line, you are correct, the TB is the source of truth...
Sure thing, and I agree it seems unlikely to be an issue 🙂
Ohh sorry you will also need to fix the def _patched_task_function
The parameter order is important as the partial call relies on it.
Hi HealthyStarfish45
Funny just today I had a similar discussion on slurm:
https://allegroai-trains.slack.com/archives/CTK20V944/p1603794531453000
Anyhow, when you say "[scale up agents]" are you referring to a machine constantly running an agent pulling jobs from the queue, where the machine itself (aka the resource) is managed as a slurm job?
Hi JumpyDragonfly13
Let's assume we have two machines, one we call remote, one we call laptop (at least for this discussion)
On the Remote machine we need to run: (notice we must have docker preinstalled on the remote machine, it can work without docker, let me know if this is the case for you)clearml-agent daemon --queue interactive --create-queue --docker
On the Laptop we runclearml-session --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
What clearml-session will do is crea...
. Curious what advantage it would be to use the StorageManager
Basically if you set the clearml cache folder to the EFS, users can always do:from clearml import StorageManager local_file = StorageManager.get_local_copy("
")
where local_file is stored on persistent cache (EFS) and the cache is automatically cleaned based on last accessed file
GrievingTurkey78 notice that when enqueuing an aborted Task, the agent will not deleted the previously reported metrics/logs
in which I can just spawn an ad-hoc worker
Can you elaborate on what you would do with it? Like an OS environment disable the entire setup itself ? will it clone the code base ?
GrievingTurkey78
maybe since the package is not directly imported in my code it is possible to get a different version to what I have locally (?).
If these are derivative packages (i.e. imported by other packages) they are not automatically logged when executing the Task manually (in order to keep the "installed packages as lean as possible on the one hand but specify also specify the important packages for you)
That said, when the "trains-agent" executed the task it will store nack...
I could take a look and figure that out.
This will greatly accelerate integration 😉
. but when we try to do a "New Run" from UI, it tries to follow the DAG of previous run (the run with all child nodes skipped) and the new run fails too.
This is odd, is this reproducible ? what's the clearml python package version ?
Oh i get it now, can you test:git ls-remote --get-url github
and thengit ls-remote --get-url
AttractiveCockroach17 can you provide some insight on the pipeline creation?
is this repo installed on the machine creating the pipeline ?
You can also manually add it here `packages={"link_to_internal_python_package",]
None
GrievingTurkey78 can you send the entire log?
I will take any suggestion 🙂git remote -v
could be a good start but I'm not familiar with the output structure, is there a template for parsing ?
(BTW: you can disable the auto-logging feature of joblib)Task.init(..., auto_connect_frameworks={'scikit': False})
based on this one:
https://stackoverflow.com/questions/31436407/git-ls-remote-returns-fatal-no-remote-configured-to-list-refs-from
I think this is a specific issue of the local git repo configuration, can you verify
(btw: I tested with git 2.17.1 git ls-remote --get-url
will return the remote url, without an error)
You can try callingtask._update_repository()
I'm still trying to figure out how to reproduce it...