Reputation
Badges 1
25 × Eureka!So this is verry odd, it looks like a pip bug:
The agent is trying to install torch==2.1.0.*
because by default it ignores the 4th+ parts (they are unstable and torch have tendency to remove them) . and for some reason pip will not match 2.1.0.*
with for example "2.1.0.dev20230306+cu118"
but based on the docs it should work:
see here: None
As a workaround you can always edit and change to the final url for example: so ...
ReassuredTiger98 no, but I might be missing something.
How do you mean project-specific?
Seems lime someone sitting in the middle and reroutes the request (maybe both https and port) ?!
Hi BroadMole98
A bit hacky but doable 🙂task = Task.get_task(task_id='aabbcc') task.get_logger().report_scalar(...)
yey 🙂 notice that when executed by the agent the call execute_remotely
is skipped, and so does the If statement I added (since running_locally will return False when the process is executed by the agent)
Hi SubstantialElk6
where exactly in the log do you see the credentials ?
/tmp/.clearml_agent.234234e24s.cfg
What's the exact setup ? (I mean are you using the glue? if that's the case I think the temp config file is only created inside the pod/docker so upon completion it will be deleted along side the pod.
Can you clone the git with the .ssh credentials on the host machine ?
If so, can you do the same manually inside a docker (i.e. spin a docker with mount -v /home/hostuser/.ssh:/root/.ssh) ?
Hi WackyRabbit7
I believe this is fixed in clearml-server 1.1 (this is a plotly color issue), releasing later today or tomorrow 🙂
Hi @<1559711593736966144:profile|SoggyCow20>
I would first like to say how amazing clearml is!
Thank you! 🙏
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
yes sdk.agent.default_docker.image = python:3.10.0-alpine
should beagent.default_docker.image = python:3.10.0-alpine
Notice the scope is agent, not sdk
Try the following example.env
:
CLEARML_SERVING_PORT=9090
CLEARML_WEB_HOST="http://<IP>:8080"
CLEARML_API_HOST="http://<IP>:8008"
CLEARML_FILES_HOST="http://<IP>:8081"
(I think the localhost is resolved to inside the container and not the host machine, hence the error)
DistressedGoat23
We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress
On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?
Hi TeenyFly97
Can I super-impose the graphs while comparing experiments?
Hmm not at the moment, I think someone asked for the option to control it, in both comparison mode and "standalone" mode.
There is a long discussion on this feature here:
https://github.com/allegroai/trains/issues/81#issuecomment-645425450
Feel free to chime in 🙂
I think that the latest agreement is a switch in the UI, separating or collecting (super-imposing) those graphs.
ElegantCoyote26 what you are after is:docker run -v ~/clearml.conf:/root/clearml.conf -p 9501:8085
Notice the internal port (i.e. inside the docker is 8080, but the external one is changed to 9501)
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
it worked!!!!
YEY!
I pass the IDs to the docker container as environment variables, so this does need restart for the docker container but I guess we can live with that for now
So this would help you decide on which actual Model file to download ? (trying to understand how the argument is being used, meaning should we have it stored somewhere? there is meta-data on the Model itself so we can use that to store the data)
Hi FrothyShark37
Can you verify with the latest version?
pip install -U clearml
Hi ShortElephant92
No, this is opt-in, so other then checking for updates once in a while, no traffic at all
Could you verify the Task.init call is inside the main function and Not the global scope? We have noticed some issues with global scope calls in some cases
Hi PanickyMoth78 , an RC is out with a fix.
pip install clearml==1.6.3rc0
Thank you for noticing the graph issue.
Btw do notice that since data is being changed inside the controller loop the parents are still kind of odd, because it is not clear to the logic the source of the data so it assumes it depends on the current state (i.e. all the leaves)
Is this reproducible? I tried to run the same example code on my machine, and it started training ...
Do you have issues with other pytorch examples? Could you try simple reporting example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
Hmm what's the clearml version? Whats the python version, whats the OS? And pytorch version?
Hi @<1625303806923247616:profile|ItchyCow80>
Could you add some prints ? Is it working without the Task.init call? the code looks okay and the - No repository found,
message basically says it logs it as a standalone script (which makes sense)
Does it wok if you remove the Task.init call?
yea, does the enterprise version have more functionality like this?
yes, all sorts of bit and pieces for easier DevOps / K8s etc.