Reputation
Badges 1
25 × Eureka!Hi @<1695969549783928832:profile|ObedientTurkey46>
Use --services-mode in the agent , it will run many Tasks on the same machine, this is usually associated with the services queue, but can be run on any queue. This way you could have the same machine easily running those multiple "control" tasks.
wdyt?
Hmm, so this is kind of a hack for ClearML AWS autoscaling ?
and every instance is running an agent? or a single Task?
Hi MortifiedCrow63
I have to admit this is very strange, I think the fact it works for the artifacts and not for the model is kind of a fluke ...
If you use "wait_on_upload" argument in the upload_artifact you end up with the same behavior. Even if uploaded in the background, the issue is still there, for me it was revealed the minute I limited the upload bandwidth to under 300kbps.It seems the internal GS timeout assumes every chunk should be uploaded in under 60 seconds.
The default chunk...
You can always click on the name of the series and remove it for display.
Why would you need three graphs?
Fix pushed to github ๐pip install git+
And still a difference between A/B , one detecting the repo the other does not?
Let me check... I think you might need to docker exec
Anyhow, I would start by upgrading the server itself.
Sounds good?
well from 2 to 30sec is a factor of 15, I think this is a good start ๐
Hi SubstantialElk6
What if I have OS library dependencies as well? (Apt install, rpm install...etc).
If these are OS libraries that you always need you can put them here:
https://github.com/allegroai/clearml-agent/blob/d9b9b4984bb8a83914d0ec6d53c86c68bb847ef8/docs/clearml.conf#L136agent.extra_docker_shell_script: ["apt-get install -y bindfs", ]
In the next version, this could be controlled on a per Task basis.
FYI: the default apt package that are installed:
` apt-get update
a...
Example Task.get_task(..., task_filter={'tags': ['best'], 'order_by': ["-last_update"]})
Hi MortifiedCrow63
Sorry getting GS credentials is taking longer than expected ๐
Nonetheless it should not be an issue (model upload is essentially using the same StorageManager internally)
Hmm check if this one works:optimizer._get_child_tasks_ids( parent_task_id=optimizer._job_parent_id or optimizer._base_task_id, order_by=optimizer._objective_metric._get_last_metrics_encode_field(), additional_filters={'page_size': int(top_k), 'page': 0})
If it does, let's PR it as a dedicated function
Hi GiddyPeacock64
If you already have K8s setup, and are already using ClearML.
In your kubeflow Yaml:trains-agent execute --id <task_id> --full-monitoring
This will install everything your Task needs inside the docker. Just make sure that you pass the env variable setting the ClearML , see here:
https://github.com/allegroai/clearml-server/blob/6434f1028e6e7fd2479b22fe553f7bca3f8a716f/docker/docker-compose.yml#L127
ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)
Hi @<1743079861380976640:profile|HighKitten20>
but when I try to use code stored in a GIT (Bitbucket) repo I got a repository cloning error, specifically
did you pass configure the git repo application/pass here: None
@<1523710674990010368:profile|GreasyPenguin14> make sure it to uses https not ssh:
edit ~/clearml.conf
force_git_ssh_protocol: false
and that you have both git_user & git_pass set in your clearml.conf
logger.report_scalar("loss-train", "train", iteration=0, value=100)
logger.report_scalar("loss=test", "test", iteration=0, value=200)
notice that the title of the graph is its uniue id, so if you send scalars to with the same "title" they will show on the same graph
JitteryCoyote63 Great to hear ๐
BTW:
Would it be possible to extendย
Task.init
ย with aย
force_reuse
ย that would enforce reusing these tasks
You can pass continue_last_task=True
I think it should be equivalent to what you suggest
Whoa, are you saying there's an autoscaler that
doesn't
use EC2 instances?...
Just to be clear the ClearML Autoscaler (aws) will spin instances up/down based on jobs in the queue it is listening to (the type of EC2 instances and configuration is fully configurable)
EnviousStarfish54 thanks again for the reproducible code, it seems this is a Web UI bug, I'll keep you updated.
If you do not have a lot of workers, that I would guess console outputs
Hi @<1720249421582569472:profile|NonchalantSeaanemone34>
Is it possible to read data directly from server w/o using get_local_copy()?
do you mean an artifact ? what is direct here?
PlainSquid19 I will also look into it as well.
maybe for some reason model.keras_model.save_weights
is not caught ...
Hi @<1729309120315527168:profile|ShallowLion60>
How did you create those credentials ?
Hey JitteryCoyote63 I think I need to better explain the config feature:agent.package_manager.post_packages = ["PyJWT"]
Basically this means that IF you have pyjwt in the installation package it will be installed after everything else is installed.
This doesn't mean it will always be installed.
Think for example "horovod" has to be installed after you have TF / PyTorch installed.
(The same goes for "pre_package" and Cython)
ShaggyHare67
Now theย
trains-agent
ย is running my code but it is unable to importย
trains
ย ...
What you are saying is you spin the 'trains-agent' inside a docker? but in venv mode ?
On the server I have both python (2.7) and python3,
Hmm make sure that you run the agent with python3 trains-agent
this way it will use the python3 for the experiments
@<1639799308809146368:profile|TritePigeon86> +1
With pleasure ๐