
Reputation
Badges 1
25 × Eureka!I'm looking into the savefig issue, meanwhile you can disable the popup by adding at the top of your code the following:import matplotlib matplotlib.rcParams['backend'] = 'agg' import matplotlib.pyplot matplotlib.pyplot.switch_backend('agg')
And command is a list instead of a single str
"command list", you mean the command
argument ?
you are correct, I was referring to the template experiment
Task.debug_simulate_remote_task
simulates the Task being executed by the agent (basically same behaviour, only local). the argument it gets is the Task ID (string).
The to see how it works is to run the code once (no debug_simulate call), get the Task ID we created, then rerun with the debug_simulate_remote_task
passing the previous Task.ID
Make sense ?
JitteryCoyote63 are you suggesting it happens ?
(obviously it should not π )
I'm guessing the extra index URL can be a URL to the github repo of interest?
The extra index URL is exactly what you would be passing to pip install, meaning it has to comply to pypi artifactory api.
Make sense ?
JitteryCoyote63
I am setting up a new machine with two rtx 3070 GPU
Nice! you are one of the lucky few who managed to buy them π
Which makes me think that the wrong torch package is installed
I think that torch 1.3.1 is does not support cuda 11 π
TenseOstrich47
I noticed that with one agent, only one task gets executed at one time
Yes you can π
Also, you are correct, a single agent will run a single Task at a time, that said you can have multiple agents running on the same machine, and when you launch them you specify which GPUs they use (in theory they can share the same GPU, but your code might not like it π )
You can see a few examples here:
https://github.com/allegroai/clearml-agent#running-the-clearml-agent
Have to get glue setup, which I couldnβt understand fully, so thatβs a different topic
I suggest using the apply template setup (basically you provide a Job/Service template, and it uses that to setup k8s jobs based on the Tasks coming in from the specific queue)
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
So does that mean CPU only workloads?
Also are we afraid of fairness? (i.e. someone "taking" all the CPU for themselves)
Ohhhh , okay as long as you know, they might fall on memory...
Is there a way to do this using ssh keys?
the .ssh of the host machine should be automatically mounted, you can force it by setting force_git_ssh_protocol: true
None
It is still not working for me. Are you using Linux, windows or macos?
should work for linux mac and windows, what are you using ?
Yes, but I'm not sure that they need to have separate task
Hmm okay I need to check if this can be easily done
(BTW, the downside of that, you can only cache a component, not a sub-component)
Hi TightDog77 _
HTTPSConnectionPool(host='
', port=443): Max retries exceeded with url: /upload/storage/v1/b/models/o?uploadType=resumable (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2633)')))
This seems like a network error to GCP, (basically GCP python package thows it)
Are you always getting this error? is this something new ?
Hi GentleSwallow91
I am very much concerned with docker container spin up time.
To accelerate spin up time (mostly pip install) use the venv cahing (basically it will store a cache of the entire installed venv so it oes not need to reinstall it)
Unmark this line:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L116
The problem above could be that I used a non-root user to train a model and all packages are installed for ...
ExcitedFish86 this is a general "dummy agent" that tasks and executes them (no env created, no code cloned, as you suggested)
hows does this work with HPO?
The HPO clones Tasks, changes arguments, push them into a queue, and monitors the metrics in real time. The missing part (from my understanding) was the the execution of the Tasks themselves required setup, and that you wanted multiple machine support, in order to overcome it, I post a dummy agent that just runs the Tasks.
(Notice...
Hi GrievingTurkey78
I'm assuming similar to https://github.com/pallets/click/
?
Auto connect and store/override all the parameters?
If I have access to the logs, python env and git commits, is there an API to log those to the experiments too?
Sure:task.update_task
see here:
https://clear.ml/docs/latest/docs/references/sdk/task#update_task
example:task.update_task(task_data={'script': {'branch': 'new_branch', 'repository': 'new_repo'}})
The easiest way to get all the different sections (they should be relatively self explanatory) is calling task.export_task() which returns a dict with all the fields yo...
It might be that the worker was killed before unregistered, you will see it there but the last update will be stuck (after 10min it will be automatically removed)
Yey! MysteriousBee56 kudos on keep trying!
I'll make sure we report those errors, because this debug process should have much shorter π
BTW, we figure out thatΒ Β
'
Β is belong the echo
yep, when seeing the full command it is apparent
MysteriousBee56 Okay, let's try this one:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && echo done"
Okay now let's try: EDITdocker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && python3 -m trains-agent --help"
MysteriousBee56 not a different port, just not with "localhost" but with your machine's IP
No after, do you see the poetry lock removed in the uncommitted changes?
it seems like each task is setup to run on a single pod/node based on the attributes like
gpu memory
,
os
,
num of cores,
worker
BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and b...
Ohh I see, could you copy paste what you put there (instead of the secret and key *** will do π )
Hmm, so currently you can provide help, so users know what they can choose from, but there is no way to limit it.
I know the Enterprise version has something similar that allows users to create a custom "application" from a Task, there you can define a drop and as such, but that might be an overkill here, wdyt?