IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?
CrookedWalrus33 this is odd I tested the exact same code.
I suspect something with the environment maybe?
Whats the python version / OS ? also can you send full pipe freeze?2022-07-17 07:59:40,339 - clearml.storage - ERROR - Failed uploading: Parameter validation failed: Invalid type for parameter ContentType, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
Yes this is odd, it should add the content-type of the file (for example "application/x-tar" but you are getting N...
LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance 🙂
Found it
GiganticTurtle0 you are 🧨 ! thank you for stumbling across this one as well.
Fix will be pushed later today 🙂
- Could you explain how I can reproduce the missing jupyter notebook (i.e. the ipykernel_launcher.py)
Hi EnviousStarfish54
I think this is what you are after
task.connect_configuration(my_dict_here, name='my_section_name')
BTW:
if you do task.connect(a_flat_dict, name='new section') you will have the key/value in a section name called "new section"
RipeGoose2 you can put ut before/after the Task.init, the idea is for you to set it before any of the real training starts.
As for not effecting anything,
Try to add the callback and just have it returning None (which means skip over the model log process) let me know if this one works
RipeGoose2 That sounds familiar. Could you test with the latest RC?pip install trains==0.16.4rc0
SoreDragonfly16 could you reproduce the issue?
What's your OS? trains versions?
I made a custom image for the VMSS nodes, which is based on Ubuntu and has multiple CUDA versions installed, as well as conda and docker pre-installed.
This is very cool, any reason for not using dockers the multiple CUDA versions?
And still a difference between A/B , one detecting the repo the other does not?
Actually with
base-task-id
it uses the cached venv, thanks for this suggestion! Seems like this is equivalent to cloning via UI.
exactly !
But “cloning” via UI runs an exact copy of the code/config, not a variant,
You can override the commit/branch and get the latest ...
run exp tweak code/configs in IDE, or tweak configs via CLI have it re-rerun in exact same venv (with no install overhead etc)So you can actually launch it remotely directly from the code:
...
I can read them programmatically using tensorboard and the log the using clearml logger,
StaleButterfly40 this will be a great script to put somewhere (I'm sure you are not the only one with this problem). Maybe put it as a GitHub issue ? wdyt ?
Hmm I see, add this for example
extra_docker_shell_script: ["rm ~/.bashrc", "echo removed bashrc"]
basically
would allow blocking the machine from being scaled-in when
Oh this is what I was missing 🙂 That makes sense to me!
So what you are saying is that the AWS autoscaler agent, when it is launching a Task, inside the container you will set "protection flag" when the Task ends, you will unset "protection flag"
Is that correct?
Hi DeterminedToad86
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.
how did you try to restart them ?
Yes, but how did you restart the agent on the remote machine ?
Hi @<1560798754280312832:profile|AntsyPenguin90>
The image itself is uploaded in a blackground process, flush just triggers the starting of the process.
Could it be that it is showing a few seconds after?
Hi TrickySheep9
Hmm I think you are correct, exit remotely will not work inside a jupyter notebook because it will not be able to close it.
I was just revising workflows that might be similar, wdyt?
https://clearml.slack.com/archives/CTK20V944/p1620506210463400?thread_ts=1614234125.066600&cid=CTK20V944
WickedGoat98 are you running the agent with --gpus ?
What’s interesting to me (as a ClearML newbie) is it’s clearly compiling that wheel using my host machine (MacOS).
Hmm kind of, and kind of not.
If you take a look at the Tasks created (regardless on how they are created,. pipeline, manually, etc.), you have a list of python packages required by the code, as they are detected at runtime (i.e. when the code was first executed, on the development machine). When creating a Pipeline controller (runner), the pipeline Tasks are just lists, ...
looks like a great idea, I'll make sure to pass it along and that someone reply 🙂
If this is the case why not have the stream process call the rest api, then move forward with the result? This way it scales out of the box, the main "conceptual" difference is that the restapi is used internally, and the upside is the event streaming processing becomes part of the application layer, not tied with the compute cost of the model , wdyt?
Basically what I want is aÂ
clearml-session
 but with a docker container running JupyterHub instead of JupyterLab.
I missed that 🙂
The idea of clearml-session
is to launch a container with jupyterlab (or vscode) on a remote machine, and connect the users machines (i.e. the machine executed the clearml-session
CLI) directly into the container.
Pleacing the jupyterlab with JupyterHub will be meaningless here, becuase the idea it spins an instance (contai...
BroadSeaturtle49 agent RC is out with a fix:pip3 install clearml-agent==1.5.0rc0
Let me know if it solved the issue
WickedGoat98 what's the clearml version you are using?
ReassuredTiger98 could you provide more information ? (versions, scenario. etc.)
Hmm that is odd. Let me take a look and ask the guys. Thank you for quickly testing the RC! I'm hoping a new RC with a fix will be there tomorrow, if we can quickly replicate