Reputation
Badges 1
25 × Eureka!SoggyBeetle95 the question is, where does clearml stores these arguments, and the answer is on the Task object (from there the agent will take them and apply to the docker execution). Now since all users see all the tasks, they also see these arguments. Wdyt?
and do you have import tensorflow in your code?
Hi TastyOwl44
So this depends on your code itself, but usually you need a CPU machine to run ClearML server (or use the free community server), than a machine to run the pipeline controller (usually the same machine running the clearml-server
, as the pipeline control code is basically controller only and does not execute the Task itself), lastly you need machines with GPU running the clearml-agent
(these GPU machines are the one actually doing the training inference etc.)
Make ...
You should have a download button when you hover over the table, I guess that would be the easiest.
If needed I can send an SDK code but unfortunately there is no single call for that
ValueError('Task object can only be updated if created or in_progress')
It seems the task
is not "running" hence the error, could that be
Hi EnthusiasticCoyote38
But one one process finished it changed task status to complete. May be you know some save way to deal with such situation? Or maybe the best way to check task status before upload object?
Well, you can actually forcefully set the state of the Task to running, then add artifacts, then close it?
would that work?
` my_other_task.reload()
my_other_task.mark_started(force=True)
my_other_task.upload_artifact(...)
my_other_task.flush(wait_for_uploads=True)
my_othe...
Sure thing!
BTW: not sure if it helps but the SaaS version integrates with Genesis Cloud I know they provide cheap GPUs might be worth checking
Will this still be considered asΒ
global site-packages
This is a pip settings, I "think" it inherits from the local user's installation, but I would actually install with "sudo pip" that will definitely be "inherited"
DefeatedCrab47 If I remember correctly v1+ has their arguments coming from argparse .
Are you using this feature ? 2. How do you set the TB HParam ? Currently Trains does not support TB HParams, the reason is the set of HParams needs to match a single experiment. Is that your case?
What is the specific use case, updating a file on existing dataset and creating a new version?
Hi ShortElephant92
No, this is opt-in, so other then checking for updates once in a while, no traffic at all
So could it be that pip install --no-deps .
is the missing issue ?
what happens if you add to the installed packages "/opt/keras-hannd" ?
Ok, so it doesn't follow the exact same rules asΒ
Task.init
?
Correct
I was afraid all the logs and outputs of a hyperparameter optimization task would be deleted just because no artifacts were created.Β (edited)
Should not happen π
UnevenDolphin73 are you positive, is this reproducible? What are you getting?
Any chance your code needs more than the main script, but it is Not in a git repo? Because the agent supports either single script file, or a git repo with multiple files
set the following:CLEARML_AGENT_DISABLE_SSH_MOUNT=1 clearml-agent daemon ...
The issue is, it will automatically mount the .ssh of the host into the container, so that if you are using SSH to clone git you have credentials, in your case, it also mounts the configuration, hence failing to login.
I will make sure we add it to the configuration file, so it is more visible
Ohh, the controller task itself holds the artifacts ?
So βwaitβ is a better metaphore for me
So I would do something like (I might have a few typos but that's the gist):
def post_execute_callback_example(a_pipeline, a_node):
# type (PipelineController, PipelineController.Node) -> None
print('Completed Task id={}'.format(a_node.executed))
# wait until model is tagged, then pass it as argument
while True:
found = Moodel.query_models(...) # model filter here, inlucing tag and project
if found:
...
Is ClearML combined with DataParallel
or DistributedDataParallel
officially supported / should that work without many adjustments?Yes it is suported, and should work
If so, would it be started via python ...
or via torchrun ...
?Yes it should, hence the request for a code snippet to reproduce the issue you are experiencing
What about remote runs, how will they support the parallel execution?Supported, You should see in the "script entry" something like "-m -m torch.di...
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace)
This will be quite easy to implement using the cleamrl k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code
AFAIK that's the only way right now (see my comment here - https://clearml.slack.com/archives/CTK20V944/p1657720159903739?thread_ts=1657699287.630779&cid=CTK20V944 )
Or then if you have the ClearML paid service, I believe there is a "vaults" service, right AgitatedDove14 ?
Yep UnevenDolphin73 :)
You should have metric :monitor:gpu
variant gpu_0_utilization
Since I see you have none of those, that points to no GPU driver ...
Could that be ?
You are correct, it is currently not supported in venv mode. We could not find a good use case for it. What is yours?
GiddyTurkey39
as others will also be running the same scripts from their own local development machine
Which would mean trains
` will update the installed packages, no?
his is why I was inquiring about theΒ
requirements.txt
Β file,
My apologies, of course this is supported π
If you have no "installed packages" (i.e. the field is empty in the UI) the trains-agent
will revert to installing the requirements.txt
from the git repo itself, then it...
So how can I temporarily fix it?
Try:task.output_uri = task.get_output_destination()
I useΒ
torch.save
Β to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?
I'm assuming the upload is http upload (e.g. the default files server)?
If this is the case, the main issue we do not have callbacks on http uploads to update the progress (which I would love a PR for, but this is actually a "requests" issue)
I think we had a draft somewhere, but I'm not sure ...
yes i can communicate with the server, i managed to put tasks in the queue and retrieve them as well as running tasks with metrics reporting
Through the UI or python code ?
Hi CheerfulGorilla72 ,
Sure there are:
https://github.com/allegroai/clearml/tree/master/examples/frameworks/pytorch-lightning