😞 anything that can be done?
Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and in...
do I need to create a brand new dataset with a new name that inherits from the original?
Yes, you just create a new version, specify the parent one, add changes and close it.
If you later need you can squash a version (same ides as git squash). Make sense ?
Hi FiercePenguin76
It seems it fails detecting the notebook server and thinks this is a "script running".
What is exactly your setup?
docker image ?
jupyter-lab version ?
clearml version?
Also are you getting any warning when calling Task.init ?
Hi VirtuousFish83
Apologies for the documentation in the docs 🙂 It sounds complicated but actually should be relatively simple. Based on what I understand, you already have the server setup and you code integrated. The question is "can you see an experiment in the UI"? If you do, then you can right click it, clone the experiment , edit parameters and send for execution (enqueue). If the experiment is not in the UI you can either (1) run the code with the Task.init call, it ill automatica...
Hi JealousParrot68
do tasks that are created through create_function_task run the entry_script again instead of just the pure function
Basically they will run the code until the "create_function_task" call, but never after. We are working on adding a decorator to a function, making it a "standalone" script, is this what you actually need ?
Hi @<1784754456546512896:profile|ConfusedSealion46>
clear ml server took so much memory usage, especially for elastic search
Yeah that depends on how many metrics/logs you have there, but you really have to have at least 8GB RAM
delete old experiments ?
Hi JitteryCoyote63
Is it possible to rollback from 1.2.0 to 1.1.0?
Not really there was a DB migration so out of the box downgrade is not really supported.
That said, v1.3.1 is already out, with what seems like a fix:
As a quick fix, can you test with auto refresh (see top right button with the pause sign you have on the video)
Can you try to manually install it and see what you are getting?python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Do we launch multiple gorups of these in different projects?
Actually Triton can serve multiple models and the endpoints/models are controlled from the clearml-serving.
The only issue is adding a load-balancer in front of multiple nodes to balance the requests between them. wdyt?
Hi RoughTiger69
How about using the pipeline decorator as a way to run this logic?
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I think I'm missing the context of where the code is executed....
btw: you can now set the configuration_objects directly when calling add_step 🙂
https://clearml.slack.com/archives/CTK20V944/p1633355990256600?thread_ts=1633344527.224300&cid=CTK20V944
WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent
readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent
Let me check what's the subsampling threshold
making me realize that this may have been optional
I think it is optional, and this is why it was not entered in the first place.
Could you double check and just remove it from your manual pbtxt ?
Hi RipeGoose2
There is no need for any TrainsLogger in pytorch lightning as they switched to using the tensorboard logging by default, and everything they pass there we automagically catch.
What do you think is missing? or can be improved ?
ahh, because task_id is the "real" id of a task
Yes the ID is a global system wide unique ID (regardless of the project etc.)
Maybe we will call tasks as
slug_yyyymmdd
Notice that you can just copy-paste the link in the address bar, it will bring you to the exact same view, meaning easily shared among users 🙂 You can, but I would actually use the Task ID. This also means that programatically you can do , task=Task,get_task(task_id_here)
and interact and query a...
Hmm good question, I'm actually not sure if you can pass 24GB (this is not a limit on the GPU memory, this affects the memblock size, I think)
maybe this can cause the issue?
Not likely.
In the original pipeline (the one executed from the Pycharm) do you see the "Pipeline" section under Configuration -> "Config objects" in the UI?
Do you have two agents pulling from the same queue ?
Maybe one of them is configured differently ?
Hi CleanPigeon16
I was wondering how (or if) you handle interruptions.
Good question, basically (and I might be missing a few details but I think that's the general gist).
A new instance will be spinned (spot/regular based on your "compute budget") as long as there is a job in the "monitored" queue. that mean that if a worker was kicked by amazon (i.e. is spot) another one will be spinned instead as long as there is a job in the queue. That means that what is probably missing in you...
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
Good point!
. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Oh I see now....
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Quick question when you say the HPO Task, you mean the HPO controller logic Task...
Ohh, the controller task itself holds the artifacts ?
hmmm I see...
It seems to miss the fact that your process do uses the GPU.
Maybe it only happens later, that the GPU is used?
Does that make sense ?
Hi @<1715175986749771776:profile|FuzzySeaanemone21>
and then run "clearml-agent daemon --gpus 0 --queue gcp-l4" to start the worker.
I'm assuming the docker service cannot spin a container with GPU access, usually this means you are missing the nvidia docker runtime component
ThickDove42 Windows conda python3.6 was exactly what I was using,
started the jupyter with:
"python -m jupyter notebook"
Then opened / created a new notebook, everything worked.
Tested on latest clearml 0.17.2
Maybe it's something with the path to the repo that breaks it? Because obviously the issue is it is looking in the wrong folder.
well it should fail, but I think the error message should be fixed 🙂
maybe:ValueError: dataset 'tmp_datset' not found in project
lavi-testing' `wdyt?