Can you also make sure you did not check "Disable local nachine git detection" in the clearml PyCharm plugin?
I just assumed it should only be triggered by dataset related things but after a lot of experimenting i realized its also triggered by tasks...
VexedCat68 I think you are correct, and it should only be triggered by "Dataset" Tasks, that said maybe there is a bug , in which case if there are no additional filters it will get triggered on Any change in the project. This will explain how adding the tags filter solved the issue.
wdyt?
Based on what I see when the ec2 instance starts it installs the latest, could it be this instance is still running?
No worries, basically they are independent, spin your JupyerHub , then every user will have to set their own credentials on the JupyterLab instance they use. Maybe there is a way to somehow connect a specific OS environment user->JupyterLab in JupyterHub, that would mean users do not have to worry about credntials. wdyt?
Hi ContemplativeCockroach39
Seems like you are running the exact code as in the git repo:
Basically it points you to the exact repository https://github.com/allegroai/clearml and the script examples/reporting/pandas_reporting.py
Specifically:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/reporting/pandas_reporting.py
Thanks GrievingTurkey78 , this is exactly what I was looking for!
Any chance you can open a GitHub issue ( jsonargparse + lighting support) ?
I really want to make sure this issue is addressed 🙂
BTW: this is only if jsonargparse is installed:
https://github.com/PyTorchLightning/pytorch-lightning/blob/368ac1c62276dbeb9d8ec0458f98309bdf47ef41/pytorch_lightning/utilities/cli.py#L33
(I think the GCP is already up, I'll double check)
Sorry @<1689446563463565312:profile|SmallTurkey79> just notice your reply
Hmm so I know the enterprise version has a built-in support for slurm, which would remove the need to deploy agents on the slurm cluster.
What you can do is on the SLURM login server (i.e. a machine that can run sbatch), write a simple script that pulls the Task ID from the queue and calls sbatch with clearml-agent execute --id <task_id_here> , would this be agood solution
ElegantKangaroo44 it seems to work here?!
https://demoapp.trains.allegro.ai/projects/0e152d03acf94ae4bb1f3787e293a9f5/experiments/48907bb6e870479f8b230e6b564cd52e/output/metrics/plots
It is available of course, but I think you have to have clearmls-server 1.9+
Which version are you running ?
Sure thing 🙂
BTW: ReassuredTiger98 this is definitely an interesting use case, and I think you can actually write some code to solve it if you like.
Basically let's followup on you setup:Machine X: agent listening to queue A, B_machine_a *notice we have two agents here Machine Y: agent listening to queue B_machine_bNow we (the users) will push our jobs into queues A and B
Now we have a service that does the following:
` see if we have a job in queue B
check if machine Y is working...
Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Is this the main issue ?
I think you have it on the workers and queues page when you click on the worker you have its detials
How about this one:
None
First I would check the CLI command it will basically prefill it for you:
https://clear.ml/docs/latest/docs/apps/clearml_task
Specifically to your question, working directory "." is the root of the git repo
But I would avoid adding it manually, use the CLI, it will either use ask you to provide info or take the git repo details from the local copy
I am actually saving a dictionary that contains the model as a value (+ training datasets)
How are you specifically doing that? pickle?
So the thing is clearml automatically detects the last iteration of the previous run, my assumption you also add it hence the double shift.
SourOx12 could that be it?
VirtuousFish83 I can confirm clearml-server 1.3 solves the issue.
Weird ?!, I see this in the code:
https://github.com/allegroai/clearml/blob/382d361bfff04cb663d6d695edd7d834abb92787/clearml/automation/controller.py#L2871
FiercePenguin76
So running the Task.init from the jupyter-lab works, but running the Task.init from the VSCode notebook does not work?
not really, the OS will almost never allow for that, actually it is based on fairness and priority. we can set the entire agent to have the same low priority for all of them, then the OS will always take CPU when needed (most of the time it won't) and all the agents will split the CPU's among them, no one will get starved 🙂 With GPUs , it is a different story, there is no actual context switching or fairness mechanisms like in CPU
RoundMole15 how does the Task.init look like?
seems like I'm passing in my own docker image which is then used at run time?
You are passing the Default docker image, if the Task does not list a specific docker image it will use the one you passed.
Yes this is "docker mode" (in venv mode no dockers are used, it just creates a new venv per experiment and installs everything inside the venv)
WittyOwl57 I can verify the issue reproduces! 🎉 !
And I know what happens, TQDM is sending an "up arrow" key, if you are running inside bash, that looks like CR (i.e. move the cursor to the begining of the line), but when running inside other terminals (like PyCharm or ClearML log) this "arrow-key" is just unicode character to print, it does nothing, and we end up with multiple lines.
Let me see if we can fix it 🙂
Hi DrabCockroach54
Notice the free GPU memory is global hence (low), but the memory (at least with new nvidia drivers) is per process. I'm assuming that the processes using the memory is not a sub process? could that be ? whats the OS you are running on?
Hi IrritableGiraffe81
Can you share a code snippet ?
Generally I would trytask = Task.init(..., auto_connect_frameworks={"pytorch': False, 'tensorflow': False)