setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).
This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit
My main concern now is that this may happen within a pipeline leading to unreliable data handling.
I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?
If
...
Regrading the demoapp, this is just a default server that allows you to start play around with ClearML without needing to setup any of your own servers or signup
That said, I would recommend to sign up (totally free) on the community server
https://app.community.clear.ml/
How can i get loaded model in Preporcess class in ClearML Serving?
ComfortableShark77
You mean your preprocess class needs a python package or is it your own module ?
PreciousParrot26 I think this is really a matter of the CI process having very limited resources. just to be clear, you are correct and the steps them selves are Not executed inside the CI environment, but it seems that even running the pipeline logic is somehow "too much" for the limited resources... Make sense ?
HiÂ
, if you don't mind having a look too,
With pleasure :)
according to the above I was expecting the config to be auto-magically updated with the new yaml config I edited in the UI, however it seems like an additional step is required.. probably connect_dict? or am I missing something
Notice the OmegaConf section description :Full OmegaConf YAML configuration. This is a read-only section, unless 'Hydra/_allow_omegaconf_edit_' is set to True
By default it will alw...
HappyDove3 where are you running the code?
(the upload is done in the background, but it seems the python interpreter closed?!)
You can also wait for the upload:task.upload_artifact(name="my artifact", artifact_object=np.eye(3,3), wait_on_upload=True)
Let me know if there is an issue 🙂
BitingKangaroo95 can you post here the entire console output of clearml-session (including full command line) ?
You can change it the CWD folder, if you put .
in working dir it will be the root git repo, but you can do any subfolder, obviously you need to change the script path to match the folder, e.g. ./folder/script.py
etc.
AntsySeagull45 kudos on sorting it out 🙂
quick note, trains-agent will try to run the python version specified by the original Task. i.e. if you were running python3.7 it will first try to look for python 3.7 then if it is not there it will run the default python3. This allows a system with multiple python versions to run exactly the python version you had on your original machine. The fact that it was trying to run python2 is quite odd, one explanation I can think of is if the original e...
but I'd prefer to have a new instance deployed for each new experiment and that it also terminates when no new experiments are queued
I'm not objecting, just wondered on the rational behind the decision 🙂
Back to the AWS autoscaler:
Basically if you have the services-agent running on your cluster, it will just run the aws-autoscaler for you 🙂
The idea of the service-agent is to run logic/monitoring Tasks suck as the aws autoscaler. Notice that service-mode means multiple job per...
Although it's still really weird how it was failing silently
totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything
I want to run only that sub-dag on all historical data in ad-hoc manner
But wouldn't that be covered by the caching mechanism ?
In any case, do you have any suggestion of how I could at least hack tqdm to make it behave? Thanks
I think I know what the issue is, it seems tqdm is using Unicode for the CR this is the 1b 5b 41
sequence I see on the binary log.
Let me see if I can hack something for you to test 🙂
Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.
Hi @<1523701260895653888:profile|QuaintJellyfish58>
Based on the docs
None
I think this should have worked, are you running the actual task_scheduler
on yout machine? on the services queue ? what's the console output you see there ?
@<1540142651142049792:profile|BurlyHorse22> do you mean the one refereed in the video ? (I think this is the raw data in kaggle)
Are you suggesting the default "ubuntu:18.04" is somehow contaminated ?
This is an official Ubuntu container (nothing to do with ClearML), this is Very Very odd...
Thanks NonchalantDeer14 !
BTW: how do you submit the multi GPU job? Is it multi-gpu or multi node ?
Which one of those? the 3d ball dots or the 3d face mesh?
I found something btw, let me check...
Funny enough I’m running into a new issue now.
Sorry my bad, I thought have known 😉 yes it probably should be packages=["clearml==1.1.6"]
BTW: do you have any imports inside the pipeline function itself ? if you do not, then no need to pass "packages" at all, it will just add clearml
ElegantCoyote26 can you browse to http://localhost:8080 on the machine that was running the trains-init ?
... grab the model artifacts for each, put them into the parent HPO model as its artifacts, and then go through the archive everything.
Nice. wouldn't it make more sense to "store" a link to the "winning" experiment. So you know how to reproduce it, and the set of HP that were chosen?
No that the model is bad, but how would I know how to reproduce it, or retrain when I have more data etc..
Weird that this code is also uploading to the 'Plots'. I replicated the same thing as my main script, but main script is still uploading to Debug Samples.
SmarmyDolphin68 are you saying the same code behaves differently ?
Still, My problem is calling
pipe.start()
crashes.
is supposed to kill the process2022-08-19 09:17:56,626 - clearml - WARNING - Terminating local execution process
This is what it writes before killing the local process.
` /opt/homebrew/anaconda3/envs/py39/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be ...
Hi FierceHamster54
Dataset is downloading multi threaded already
But yes get_local_copy() is thread / process safe
I mean just add the toy tqdm loop somewhere just before starting the lightning train function. I just want to verify that it works, or maybe there is something in the specific setup happening in real-time that changes it
Hi @<1610083503607648256:profile|DiminutiveToad80>
Yes, it does. They are also cached by default (on the machine with the agent)
None