Reputation
Badges 1
25 × Eureka!what just happened next time and what is happening underneath.
Not sure I follow, is there still an issue ?
You mean one machine with multiple clearml-agents ?
(worker is a unique ID of an agent, so you cannot have two agents with the exact same worker name)
Or do you mean two agents pulling from the same queue ? (that is supported)
clearml launches a subprocess
correct, this subprocess is used fgor resource monitoring and sending logs in the background (i.e metrics console etc.)
Where does the "training" part coming from? I'm assuming the training is your main code?
Follow up, is this happening when running manually or when executed via the agent ?
Yes it does. I'm assuming each job is launched using a multiprocessing.Pool (which translates into a sub process). Let me see if I can reproduce this behavior.
Hi @<1663354518726774784:profile|CrookedSeal85>
I am trying to optimize storage on my ClearML file server when doing a lot of experiments.
This is not straight forward, you will need to get a list of all the events via
None
filter on image events
and then delete the the URL you are getting via the StorageManager.
But to be honest, why not just direct it to S3 or something like that ?
btw:
If you need to access it, just bash into the running dockerdocker exec -it <container_name> /bin/bash
Usually in the /tmp folder under a temp filename (it is generated automatically when spinned)
In case of the services, this will be inside the docker itself
if I want to run the experiment the first time without creating theΒ
template
?
You mean without manually executing it once ?
Okay, I was able to reproduce, this will only happen if you are running from a daemon process (like in the case of a process pool), Python is sometimes very picky when it comes to multi-threading/processes I'll check what we can do π
Thanks SarcasticSparrow10 !
I'll later reply the Github issue (for better visibility)
But my initial thoughts:
(1) I think this was suggested, and hopefully we will get to implementing it, I can definitely see the value. Meanwhile you can achieve some of the functionality with the experiment table and custom columns π
(2) "Don't display the performance metric" -> isn't that important? what am I missing?
(3) Hmm you mean just extra columns?
(4) sounds like a bug
(5) is this a plotly issue?...
But I am considreing just failing the task.
This will of course work, just raise exception in the Task itself, and protect the call from the pipeline logic function with try/except
regrading the second option, try to nullify the hash on the Component Task:
# running the Task component here
# if we do not want someone to use us
Task.current_task()._set_runtime_properties({"pipeline_job_hash": None})
According to you the VPN shouldn't be a problem right?
Correct as long as all parties are on the same VPN it should work, all the connections are always http so basically trivial communication
It is http btw, i don't know why it logged https://
This is odd could it be it automatically forwards to https ?
I would try the certificate check thing first
I'm assuming the reason it fails is that the docker network is Only available for the specific docker compose. This means when you spin Another docker compose they do not share the same names. Just replace with host name or IP it should work. Notice this has nothing to do with clearml or serving these are docker network configurations
I would just add git+
None to your requirements (either in the requirements.txt or even better as part of the pipeline/component where you also specify the repo to be used)
The agent will automatically push the crednetilas when it installs the repo as wheel.
wdyt?
btw: you might also get away with adding -e .
into the requirements.txt (but you will need to test that one)
ReassuredTiger98
How can I make clearml-agent use pre-installed version from the nvidia/pytorch
If the Same version is required, the agent will not try to reinstall it (the new venv the agent is creating inside the container, inherits from the preinstalled system packages)
Comes with PyTorch Version 1.12 based on a commit
. I tried
torch >= 1.11
,
torch == 1.12
If in your installed packages you have torch==1.12
the agent should not tr...
Hi, what is host?
The IP of the machine running the ClearML server
Hi SmarmyDolphin68
You have two options:
Automatically upload the models when training pass output_uri
to Task.init. For example output_uri=True
will upload to the clearml-server, output_uri='
s3://bucket/folder '
will upload to S3 etc. Manually upload a model that you have locally: https://github.com/allegroai/clearml/blob/9ff52a8699266fec1cca486b239efa5ff1f681bc/examples/reporting/model_config.py#L37
The second problem that I am running into now, is that one of the dependencies in the package is actually hosted in a private repo.
Add your private repo to the extra index section in the clearml.conf:
None
Hi GloriousPenguin2 , Sorry this is a bit confusing. Let me expand:
When converting into a plotly object (the default), you cannot really control the dimensions of the plot in the UI programatically, you can however drag the seperator and expand width / height If you pass to report_matplotlib_figure
the argument " report_image=True,
" it will create a static image from matplotlib figure (as rendered locally) and use that as the figure, this way you get exactly wysiwyg , but the...
cannot schedule new futures after interpreter shutdown
This implies the process is shutting down.
Where are you uploading the model? What is the clearml version you are using ? can you check with the latest version (1.10) ?
Hi @<1684735407637401600:profile|WonderfulJellyfish65>
BTW, the training script connects to apiserver via the internal IP address
That is a big issue, because as you noticed the links to data =generated by the code will have the internal IP ...
You basically need every component to use the same address (url)
ok so i accidentally (probably with luck) noticed the max_connection: 2 in the azure.storage config.
NICE!!!! π
But wait where is that set?
None
Should we change the default or add a comment ?
NastyFox63 ask SuccessfulKoala55 tomorrow, I think there is a way to change the default settings even with the current version.
(I.e. increase the default 100 entries limit)
I came across it before but thought its only relevant for credentials
We are working on improving the docs, hopefully it will get clearer π
The agent is using Bash (but when you add command line to the docker run, .bashrc is not executed, hence no conda
in PATH)
Maybe add the full path to the conda executable:ocker_setup_bash_script= [ "export PATH=""/workspace/miniconda/bin:$PATH", "export LOCAL_PYTHON=/workspace/miniconda/bin/python3","/workspace/miniconda/bin/conda activate /PATH_GOES_HERE"])
the trend step artifact used to keep track the time of the data so we know the expected trend of the input data. For example, on the first data which is trend_step = 1 the trend value is 10, then if the trend_step = 10 (the tenth data) our regressor will predict the trend value of the selected trend_step. this method is still in research to make it more efficient so it doesn't need to upload artifact every request
Make sense! I would suggest you add a GitHub issue with feature request ...