Reputation
Badges 1
25 × Eureka!Try to upload something to the file server ?
None
I'm not familiar with this one, I think you should be able to control it with:
None
CLEARML_AGENT__API__HTTP__RETRIES__BACKOFF_FACTOR
TroubledHedgehog16 generally speaking you can expect about 10 api calls per minute if you have many reports, and about 3 per minute on low report. We just optimized the sdk so in cases there are lots of consequential reports they are better batched, I would recommend the latest RC
But my previous ques and other query are still not figured out.
What do you mean by "previous ques and other query" ?
HandsomeCrow5
So using the _edit
method you have the ability to add/edit the execution.script field, without worrying about the API version (I guess the name edit
is misleading, it also does add :)
How much free RAM / disk do you have there now? How's the CPU utilization ? how many Tasks are working with this machine at the same time
Create one experiment (I guess in the scheduler)
task = Task.init('test', 'one big experiment')
Then make sure the the scheduler creates the "main" process as subprocess, basically the default behavior)
Then the sub process can call Task.init and it will get the scheduler Task (i.e. it will not create a new task). Just make sure they all call Task init with the same task name and the same project name.
ShakyJellyfish91 what exactly are you passing to Task.create?
Could it be you are only passing script=
and leaving repo=
None ?
it seems it's following the path of the script i'm using to task.create, eg:
The folder it should run it is the script path you are passing (i.e. "script=ep_fn," )
Wrong path would imply that is it not finding the correct repository, is that the case ?
Notice there is no need to upgrade the server, only the ClearML python package
Okay, so I think it doesn't find the correct Task, otherwise it wouldn't print the warning,
How do you setup the HPO class ? Could you copy paste the code?
BattyLizard6 to my knowledge the main issue with fractional GPU, is there is no real restriction on GPU memory allocation (with the exception of MIG slices, which is limited in other ways).
Basically one process/container can consume the maximum GPU ram on the allocated card (this also includes http://run.ai fractional solution, at least from what I understand).
This means that developer A can allocate memory so that developer B on the same GPU will start getting out-of-memory
(Notice in a...
Hi MistakenDragonfly51
Is it possible to use it without using the clearml agent system?
Yes it is, which would mean everything is executed locally
basically:an_optimizer.start_locally()
instead of this line
https://github.com/allegroai/clearml/blob/51af6e833ddc5a8ba1efaaf75980f58616b25e85/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py#L121
This workflow however is the only way I have found to easily fix my previous βModule not foundβ errors
Hmm okay make sense,
Did you try to set these ?
or even hack the sys.path with something likeimport sys, os sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)+"/../")
I think EmbarrassedSpider34 is correct.
When you pass the requirements to clearml-task, actually the agent depending on how it was configured (conda / pip) will do the installation.
That said, maybe it is worth adding support to provide the env.yml in the CLI ?
(Notice that adding specific channels needs to be configured on the agent, they are not stored per Task)
AlertCamel57 wdyt?
Now I am passing it the same way you have mentioned, but my code still gets stuck as in above screenshot.
The screenshot shows warning from pyplot (matplotlib) not ClearML, or am I mising something ?
My guess is that it can't resolve credentials. It does not give me any pop up to login also
If it fails, you will get an error, there will never a popup from code π
... We need a more permanent place to store data
FYI you can store the "Dataset" itself on GS (instead of...
Notice: dataset_rgb.list_files()
will list the content of the dataset, Not the local files:
e.g.: /folder/myfile.ext
and not /hone/user/cache/folder/myfile.ext
So basically i think you are just not passing actual files, you should probably do:for local_file in Path(folder_rgb).rglob('*'): ...
IdealPanda97 hmmm interesting, what's exactly the scenario here?
It should be the last line (or almost) of the Log. is it there ? Also it seems that from the log, that trains you are using trains 0.14.3 , try with trains 0.15 , let me know if you are still missing packages
LazyTurkey38 I think this is caused by new versions of pip to report the wrong link:
https://github.com/bwoodsend/pip/commit/f533671b0ca9689855b7bdda67f44108387fe2a9
Yes actually that might be it. Here is how it works,
It launch a thread in the background to do all the analysis of the repository, extracting all the packages.
If the process ends (for any reason), it will give the background thread 10 seconds to finish and then it will give up. If the repository is big, the analysis can take longer, and it will quit
In the UI you can edit the base container image + add "SETUP SHELL SCRIPT", with any missing "apt update && apt-get install -y ..."
And you have the exact same folder structure / content, and server A/B give a different set of experiments ?
(is serverB empty, meaning no experiments at all?)
ConvolutedSealion94 try scikit
not scikitlearn
I think we should add a warning if a key is there and is being ignored... let me make sure of that
I just disabled all of them with
auto_connect_frameworks=False
Yep that also works
ConvolutedSealion94 Let me try to explain how it works, I hope this will help in debugging.
There are two different entities here
Clearml-server: In this context clearml server acts as a control-plane, it stores configuration on the different endpoints, models, preprocessign code etc. It does Not perform any compute or serving clearml-serving-inference https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L77 . This ...
DilapidatedDucks58 long story short:
if you do:
` from clearml import StorageManager
from clearml.storage.helper import StorageHelper
StorageHelper.get(" ", retries=5) `It should make sure that all the other s3:// links of this bucket will use the same original configuration (i.e. retries)
If this workaround works let's make sure we add it into the conf file, wdyt ?