Reputation
Badges 1
25 × Eureka!Thanks VivaciousPenguin66 !
BTW: if you are running the local code with conda, you can set the agent to use conda as well (notice that if you are running locally with pip, the agent's conda env will use pip to install the packages to avoid version mismatch)
Okay I think I know what's going on (there is a race that for some reason on CoLab acts differently).
As a quick hack you can do the following:Task._report_subprocess_enabled = False task = Task.init(...) task.set_initial_iteration(0)
But I do not know how it can help me:(
In your code itself after the Task.init
call add:task.set_initial_iteration(0)
See reply here:
https://github.com/allegroai/clearml/issues/496#issuecomment-980037382
I can't think of any actual difference in flow ...
Can you try the following?task._setup_reporter() task.set_initial_iteration(0)
Hmm, yes this fits the message. Which basically says that it gave up on analyzing the code because it run out of time. Is the execution very short? Or the repo very large?
@<1585078763312386048:profile|ArrogantButterfly10> could it be that in the "base task" of the pipeline step, you do not have any hyper-parameter ? (I mean the Task that the pipeline clones and is supposed to set new hyperparameters for...)
Hi RipeGoose2
Are you continuing the Task, i.e. passing Task.init(..., continue_last_task=True)
Yes that's the reason, basically there is a background thread analyzing the code, at the end of the execution if it is till running (hence the question regrading execution time) we give it extra 10seconds to come up with answers, otherwise we terminate, so the code won't get stuck. Makes sense to you?
SmarmyDolphin68 sadly if this was not executed with trains (i.e. the offline option of trains), this is not really doable (I mean it is, if you write some code and parse the TB π but let's assume this is way to much work)
A few options:
On the next run, use clearml OFFLINE option, (i.e. in your code call Task.set_offline() , or set env variable CLEARML_OFFLINE_MODE=1) You can compress the upload the checkpoint folder manually, by passing the checkpoint folder, see https://github.com...
Is this reproducible with the hpo example here:
https://github.com/allegroai/clearml/tree/400c6ec103d9f2193694c54d7491bb1a74bbe8e8/examples/optimization/hyper-parameter-optimization
What's your clearml version? (And is it possible you verify with the latest version?)
Anyway, in the docs, there is a function called task.register_artifact()
Yes, this is rather deprecated... The idea is that it will monitor an obejct and auto sync it (i.e. serialize and upload).
That said, it is just so much easier to do task.upload_artifact
and you can always update/overrwrite if you are passing the same name, that I cannot see the actual use case. Does that make sense? What are you using it for ?
shared "warm" folder without having to download the dataset locally.
This is already supported π
Configure the sdk.storage.cache.default_base_dir
in your clearml.conf to point to a shared (mounted) folder
https://github.com/allegroai/clearml-agent/blob/21c4857795e6392a848b296ceb5480aca5f98e4b/docs/clearml.conf#L205
That's it π
ok, yes, but this will install the package of the branch specified there.
Correct
So If im working on my own branch and want to run an experiment, I would have to manually put in the git path my current branch name.
When you say your own branch you mean local (i.e. not pushed to remote git repo) ?
Is Task.current_task() creating a task?
Hmm it should not, it should return a Task instance if one was already created.
That said, I remember there was a bug (not sure if it was in a released version or an RC) that caused it to create a new Task if there isn't an existing one. Could that be the case ?
ZanyPig66 is this reproducible? This sounds like a bug, whats the TB version and OS you rae using?
Is this example working for you (i.e. you see debug images)
https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_tensorboard.py
SmilingFrog76
there is no internal scheduler in Trains
So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)
Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the op...
Hi OutrageousGrasshopper93
When the Task is executed on a worker, the presence of spaces breaks the URLs and from the UI I cannot access to the resources on the bucket
You are saying the URLs generated in a remote execution are "broken" and on local execution are working, even though it is the same project/task name ?
You can already sort and filter experiments based on any hyper parameter or metric that the experiment reports, there is no need for any custom language query. Also all created filter/sorted table can be shared exactly as they are, so you can create leaderboards and share specific filters. You can also use the search bar in order to filter based on experiment name / comment. Tags will be added soon as well π
Example of custom columns is here (the screen grab is a bit old, now there is als...
model_path/run_2022_07_20T22_11_15.209_0.zip , err: [Errno 28] No space left on device
Where was it running?
I take it that these files are also brought into pipeline tasks's local disk?
Unless you changed the object, then no, they should not be downloaded (the "link" is passed)
Hi PerplexedGoat65
it appears, in a practical sense, this means to mount the second drive, and then bind them in ClearMLβs configuration
Yes, the entire data
folder (reason is, if you loose it, you loose all the server storage / artifacts)
Also, thinking about Docker and slower access speed for Docker mounts and such,
If the host OS is linux, you have nothing to worry about, speed will be the same.
ReassuredTiger98 when you look for task "dca2e3ded7fc4c28b342f912395ab9bc" there are no artifacts ?
Could you add some prints? this should have worked...
Hi @<1523702307240284160:profile|TeenyBeetle18>
and url of the model refers to local file, no to the remote storage.
Do you mean that in the Model tab when you look into the model details the URL points to a local location (e.g. file:///mnt/something/model) ?
And your goal is to get a copy of that model (file) from your code, is that correct ?
The easiest would be as an artifact (I think).
Let's assume you put it into a csv file (with pandas or mnaually)
To upload (from the pipeline Task itself):task.upload_artifacts(name='summary', artifact_object='~/my/summary.csv')
Then if you want to grab it from anywhere else:task = Task.get_task(task_id='HPO controller Task id here') my_csv = Task.artifacts['summary'].get_local_copy()
If you want to store as dict it might be even easier:
` task.upload_artifacts(name='summary', artifa...