CrookedWalrus33
Force SSH git authentication, it will auto mount the .ssh from the host to the docker
https://github.com/allegroai/clearml-agent/blob/6c5087e425bcc9911c78751e2a6ae3e1c0640180/docs/clearml.conf#L25
Hi SmallDeer34
The any generally any pytorch.save(...) is logged/uploaded by clearml automatically. specifically in your case I think the only missing one is the trainer_sate.json, which I assume is general json file, and I imagine is part of huggingface framework. You can easily upload it as additional artifact with Task.upload_artifact wdyt?
Is it only for modified changes and not untracked files?
basically everything that "git diff" will output.
Then the agent will re-apply it on a remote machine
The Overview panel would be extremely well suited for the task of selecting a number of projects for comparing them.
Could you elaborate ?
Another useful feature would be to allow adding information (e.g. metrics or metadata) to the tooltip.
You mean are we still talking about the "Overview" Tab?
Hi @<1554275779167129600:profile|ProudCrocodile47>
Do you mean @ clearml.io ?
If so, then this is the same domain (.ml is sometimes flagged as spam, I'm assuming this is why they use it)
RipeGoose2 you are not limited to the automagic
From anywhere in your code you can always do:from trains import Logger Logger.current_logger().report_plotly(...)So you can add any manual reporting on top of the one generated by lightning .
Sounds good?
Thanks@doru! BTW if you are running a code from outside the trains repo, do you still get the double package?
Wait, so the pipeline step only runs if the pre execute callback returns True? It'll stop if it doesn't run?
Only if you have a Callback function, and that callback function returns False, then it will skip it (otherwise it will process it)
Another question, in the parents sequence in pipe.add_step, we have to pass in the name of the step right?
Correct, the step name is a unique identifier for the pipeline
how would I access the artifact of a previous step within the pre ...
Hi @<1523701523954012160:profile|ShallowCormorant89>
This is generally based on number of agents, or am I missing something ? Also is it based on Task or decorated functions ?
exactly! it is very cool to see it in action, and it really works very well, kudos for these guys
BroadSeaturtle49 agent RC is out with a fix:pip3 install clearml-agent==1.5.0rc0Let me know if it solved the issue
CheerfulGorilla72
yes, IP-based access,
hmm so this is the main downside of using IP based server, the links (debug images, models, artifacts) store the full URL (e.g. http://IP:8081/ http://IP:8081/... ) This means if you switched IP they will no longer work. Any chance to fix the new server to the old IP?
(the other option is somehow edit the DB with the links, I guess doable but quite risky)
This is a horrible setup, it means no authentication will pass, it will literally break every JWT authentication scheme
Notice that if you are using TB, everything you report to the TB will appear as well 🙂
Yes, offline got broken in 1.3.0 😞 , RC fixed it:pip install clearml==1.3.1rc0Stable release later this week
Thanks SarcasticSparrow10 !
I'll later reply the Github issue (for better visibility)
But my initial thoughts:
(1) I think this was suggested, and hopefully we will get to implementing it, I can definitely see the value. Meanwhile you can achieve some of the functionality with the experiment table and custom columns 🙂
(2) "Don't display the performance metric" -> isn't that important? what am I missing?
(3) Hmm you mean just extra columns?
(4) sounds like a bug
(5) is this a plotly issue?...
Hi GreasyPenguin14
However the cleanup service is also running in a docker container. How is it possible that the cleanup service has access and can remove these model checkpoints?
The easiest solution is to launch the cleanup script with a mount point from the storage directory, to inside the container ( -v <host_folder>:<container_folder> )
The other option, which clearml version 1.0 and above supports, is using the Task.delete, that now supports deleting the artifacts and mod...
Yey!
Out of curiosity, what's the workflow with snowflake?
That might be me, let me check...
too large to be stored in the .cache path? It will be stored there anyway?
oh that is exactly why the latest release supports chunks, so you can get a partial copy 🙂
nonetheless, the assumption is that you will have to end up with the data locally, otherwise the network becomes a huge bottleneck
make sense ?
Thanks NonchalantDeer14 !
BTW: how do you submit the multi GPU job? Is it multi-gpu or multi node ?
do you have a video showing the use case for clearml-session
I totally think we should, I'll pass it along 🙂
what is the difference between vscode via clearml-session and vscode via remote ssh extension ?
Nice! remote vscode is usually thought of as SSH, basically you have your vscode running on your machine, and using SSH vscode automatically connects to the remote machine.
Clearml-Session also ads a new capability VSCode inside your browser, where the VSCode itself as well...
Hi @<1523711619815706624:profile|StrangePelican34>
You can either report on the Model itself:
None
or you can force it on the Task:
task = Task.get_task("task id here")
task.mark_started(force=True)
task.get_logger().report_scalar(...)
task.mark_completed(force=True)
Hi @<1715175986749771776:profile|FuzzySeaanemone21>
and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.
I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component
DeliciousBluewhale87
Upon ssh-ing into the folders in the both the physical node (/opt/clearml/agent) and the pod (/root/.clearml), it seems there are some files there..
Hmm that means it is working...
Do you see there a *.conf files? What do they contain? (it point to the correct clearml-server config)
Wait, that makes no sense to me. The API from python and the API from the UI are getting the same data from the backend ...
What are you getting with?from clearml import Task task = Task.get_task(task_id=<put task id here>) print(task.models)