
Reputation
Badges 1
25 × Eureka!I think this is the main issue, is this reproducible ? How can we test that?
Okay, I'll make sure we change the default image to the runtime flavor of nvidia/cuda
Hmm so the Task.init should be called on the main process, this way the subprocess knows the Task is already created (you can call Task.init twice to get the task object). I wonder if we somehow can communicate between the sub processes without initializing in the main one...
Is this reproducible with the hpo example here:
https://github.com/allegroai/clearml/tree/400c6ec103d9f2193694c54d7491bb1a74bbe8e8/examples/optimization/hyper-parameter-optimization
What's your clearml version? (And is it possible you verify with the latest version?)
Hi @<1543766544847212544:profile|SorePelican79>
You want the pipeline configuration itself, not the pipeline component, correct?
pipeline = Task.get_task(Task.current_task().parent)
conf_text = pipeline.get_configuration_object(name="config name")
conf_dict = pipeline.get_configuration_object_as_dict(name="config name")
so that you can get the latest artifacts of that experiment
what do you mean by " the latest artifacts "? do you have multiple artifacts on the same Task or s it the latest Task holding a specific artifact?
Hi @<1542316991337992192:profile|AverageMoth57>
is this a follow up of this thread? None
Hi GreasyPenguin14
However the cleanup service is also running in a docker container. How is it possible that the cleanup service has access and can remove these model checkpoints?
The easiest solution is to launch the cleanup script with a mount point from the storage directory, to inside the container ( -v <host_folder>:<container_folder>
)
The other option, which clearml version 1.0 and above supports, is using the Task.delete, that now supports deleting the artifacts and mod...
Yep, basically this will query the Task and get the last one:
https://github.com/allegroai/clearml/blob/ca70f0a6f6d52054a095672dc087390fabf2870d/clearml/task.py#L729
Notice task_filter
allows you do do all sorts of filtering
https://github.com/allegroai/clearml/blob/ca70f0a6f6d52054a095672dc087390fabf2870d/clearml/task.py#L781
Hi CluelessElephant89
hey guys, I believeΒ
clearml-agent-services
Β isn't necessary right?
Generally speaking, yes you are corrected π
Specifically, this is the "services" queue agent, running your pipeline logic, services etc.
But it is not a must to get the server to work, and you can also spin it on a different host
Of course, I used "localhost"
Do not use "localhost" use your IP then it would be registered with a URL that points to the IP and then it will work
None
No they are not, they are taking the vscode backend and put it behind a webserver-ish
Hmmm are you saying the Dataset Tasks do not have the "dataset" system_tag as well as the type ?
FranticCormorant35 DeterminedCrab71 please continue the discussion in this thread
JitteryCoyote63 I think this one:
https://github.com/allegroai/clearml/blob/master/examples/services/cleanup/cleanup_service.py
RobustGoldfish9
I think you need to set the trains-agent docker to be aware of the host, so it knows how to mount data/cache/configurations into the sibling docker
It should look something like:TRAINS_AGENT_DOCKER_HOST_MOUNT="/mnt/host/data:/root/.trains"
So if running a docker:docker run -e TRAINS_AGENT_DOCKER_HOST_MOUNT="/mnt/host/data:/root/.trains" ...
function and just seem to be getting an "isadirectory" error?
Can you post here what you are getting ? which clearml version are you using ?!
also tried manually adding
leap==0.4.1
in the task UI which didn't work.
That has to work, if it did not, can you send the log for the failed Task (or the Task that did not install it)?
The environment in the logs does show that leap is being installed potentially from a cache?
- leap @ file:///opt/keras-hannd...
Hmm maybe we should add a test once the download is done, comparing the expected file size and the actual file size, and if they are different we should redownload ?
It might be the file upload was broken?
Hmm BitterStarfish58 what's the error you are getting ?
Any chance you are over the free tier quota ?
So are you saying the large file size download is the issue ? (i.e. network issues)
JitteryCoyote63 , just making sure, does refresh fixes the issue ?
DefeatedCrab47 no idea, but you are more then welcome to join the thread here, and point it out:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/issues/249
Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?
Interesting!
Wouldn't Dataset (class) be a good solution ?
Task.init should be called before pytorch distribution is called, then on each instance you need to call Task.current_task() to get the instance (and make sure the logs are tracked).
You can try direct API call for all the Tasks together:Task._query_tasks(task_ids=[IDS here], only_fields=['last_metrics'])
Hi @<1545216070686609408:profile|EnthusiasticCow4>
Oh dear, I think this argument is not exposed π
- You can open a GH
- If you want to add a PR this is very simple:None
include_archived=False,
):
if not include_archived:
system_tags = ["__$all", cls.__tag, "__$not", "archived"]
else:
system_tags = [cls.__tag]
...
system_tag...
and the step is "queued" or is it "queued" in the pipeline state (i.e. the visualization did not update) ?