Reputation
Badges 1
25 × Eureka!WackyRabbit7 I do 'pkill -f trains' but it's the same... If you need to debug and test run with --foreground and just hit ctrl-c to end the process (it will never switch to background...). Helps?
You need trains-server support, so if trains v0.15 is working with older backend it will revert to "training" type
LovelyHamster1 verified, this is a UI bug with old limitation enforced.
I will make sure they know about it, it should be fixed for the upcoming release 🙂
Hi LovelyHamster1
As you noted, passing overrides in Args/overrides , for example ['training.max_epochs=1000']
should work when running with the agent.
Could you verify with the latest RC, there was a fix to support the latest hydra versionpip install clearml==0.17.5rc5
I think the only way is using the API, with task.query_tasks and filter, would that have helped?
Sure 🙂
BTW: clearml-agent will mount your host .ssh into the docker to /root/.ssh by default.
So no need to do that manually
Like, if you google "dagster and clearml" or "prefect and clearml" or "airflow and clearml" -- I don't find any blogs written by people talking about how they use both of them together.
Oh yeah I see your point, I think the main reason is a lot of the dag capabilities and the orchestration is already folded into clearml's capabilities (i.e. pipelines + clearml-agent etc.)
That said I'm pretty sure I have seen just adding Task.init into each of a the framework above steps, in order to t...
(This is why we recommend using pip, because it is stable and clearml-agent takes care of pytorch/cuda verions)
@<1610083503607648256:profile|DiminutiveToad80> try to turn on:
None
enable_git_ask_pass: true
I'll try to find the link...
@<1587253076522176512:profile|HollowPeacock33>
Is this a commercial ad? this seems like out of scope for this channel
Can you expand?
But there is no need for 2FA for cloning repo
My clearml-server server crashed for some reason
😞 No worries
Seems like something is not working with the server, i.e. it cannot connect with one of the dockers.
May I suggest to carefully go through all the steps here, make sure nothing was missed
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md
Especially number (4)
"erasing" all the packages that had been set in the base task I'm cloning from. I
Set is not add, if you are calling set_packages, you are overwriting all of them with this single call.
You can however do:
task_data = task.export_task()
requirements = task_data["script"]["requirements"]["pip"]
requirements += "new packages"
task.set_packages(requirements)
I guess we should have get_requirements ?!
Okay this is very close to what the agent is building:
Could you start a new conda env,
then install cudatoolkit=11.1
then run:
conda env update -p <conda_env_path_here> --file the_env_yaml.yml
What would be the best way to get all the models trained using a certain Task, I know we can use query_models to filter models based on Project and Task, but is it the best way?
On the Task object itself you have all the models.Task.get_task(task_id='aabb').models['output']
Glad to hear that, indeed an odd issue... is this reproducible i.e. can we get something to fix it?
Exactly! nice 🎉
Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring " It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"
Basically saying, if you can run nvidima-smi from inside the container, it should work.
BoredGoat1
Hmm, that means it should have worked with Trains as well.
Could you run the attached script, see if it works?
MagnificentSeaurchin79
"requirements.txt" is ignored if the Task has an "installed packges" section (i.e. not completely empty) Task.add_requirements('pandas') needs to be called before Task.init() (I'll make sure there is a warning if called after)
Yes that is an issue for me, even if we could centralize an environment today, it leaves a concern whenever we add a model that possible package changes are going to cause issues with older models.
yeah changing the environment on the fly is tricky, it basically means spinning an internal http service per model...
Notice you can have many clearml-serving-sessions, they are not limited, so this means you can always spin new serving with new environments. The limitation is changing an e...
Yeah.. that should have worked ...
What's the exact error you are getting ?
Since I'm assuming there is no actual task to run, and you do not need to setup the environment (is that correct?)
you can do:$ CLEARML_OFFLINE_MODE=1 python3 my_main.pywdyt?
I assume it is reported into TB, right ?