HandsomeCrow5
So using the _edit
method you have the ability to add/edit the execution.script field, without worrying about the API version (I guess the name edit
is misleading, it also does add :)
How much free RAM / disk do you have there now? How's the CPU utilization ? how many Tasks are working with this machine at the same time
Create one experiment (I guess in the scheduler)
task = Task.init('test', 'one big experiment')
Then make sure the the scheduler creates the "main" process as subprocess, basically the default behavior)
Then the sub process can call Task.init and it will get the scheduler Task (i.e. it will not create a new task). Just make sure they all call Task init with the same task name and the same project name.
ShakyJellyfish91 what exactly are you passing to Task.create?
Could it be you are only passing script=
and leaving repo=
None ?
it seems it's following the path of the script i'm using to task.create, eg:
The folder it should run it is the script path you are passing (i.e. "script=ep_fn," )
Wrong path would imply that is it not finding the correct repository, is that the case ?
Notice there is no need to upgrade the server, only the ClearML python package
Okay, so I think it doesn't find the correct Task, otherwise it wouldn't print the warning,
How do you setup the HPO class ? Could you copy paste the code?
BattyLizard6 to my knowledge the main issue with fractional GPU, is there is no real restriction on GPU memory allocation (with the exception of MIG slices, which is limited in other ways).
Basically one process/container can consume the maximum GPU ram on the allocated card (this also includes http://run.ai fractional solution, at least from what I understand).
This means that developer A can allocate memory so that developer B on the same GPU will start getting out-of-memory
(Notice in a...
Hi MistakenDragonfly51
Is it possible to use it without using the clearml agent system?
Yes it is, which would mean everything is executed locally
basically:an_optimizer.start_locally()
instead of this line
https://github.com/allegroai/clearml/blob/51af6e833ddc5a8ba1efaaf75980f58616b25e85/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py#L121
This workflow however is the only way I have found to easily fix my previous ‘Module not found’ errors
Hmm okay make sense,
Did you try to set these ?
or even hack the sys.path with something likeimport sys, os sys.path.insert(0, os.path.abspath(os.path.dirname(__file__)+"/../")
I think EmbarrassedSpider34 is correct.
When you pass the requirements to clearml-task, actually the agent depending on how it was configured (conda / pip) will do the installation.
That said, maybe it is worth adding support to provide the env.yml in the CLI ?
(Notice that adding specific channels needs to be configured on the agent, they are not stored per Task)
AlertCamel57 wdyt?
Now I am passing it the same way you have mentioned, but my code still gets stuck as in above screenshot.
The screenshot shows warning from pyplot (matplotlib) not ClearML, or am I mising something ?
My guess is that it can't resolve credentials. It does not give me any pop up to login also
If it fails, you will get an error, there will never a popup from code 🙂
... We need a more permanent place to store data
FYI you can store the "Dataset" itself on GS (instead of...
Notice: dataset_rgb.list_files()
will list the content of the dataset, Not the local files:
e.g.: /folder/myfile.ext
and not /hone/user/cache/folder/myfile.ext
So basically i think you are just not passing actual files, you should probably do:for local_file in Path(folder_rgb).rglob('*'): ...
IdealPanda97 hmmm interesting, what's exactly the scenario here?
It should be the last line (or almost) of the Log. is it there ? Also it seems that from the log, that trains you are using trains 0.14.3 , try with trains 0.15 , let me know if you are still missing packages
LazyTurkey38 I think this is caused by new versions of pip to report the wrong link:
https://github.com/bwoodsend/pip/commit/f533671b0ca9689855b7bdda67f44108387fe2a9
Yes actually that might be it. Here is how it works,
It launch a thread in the background to do all the analysis of the repository, extracting all the packages.
If the process ends (for any reason), it will give the background thread 10 seconds to finish and then it will give up. If the repository is big, the analysis can take longer, and it will quit
In the UI you can edit the base container image + add "SETUP SHELL SCRIPT", with any missing "apt update && apt-get install -y ..."
And you have the exact same folder structure / content, and server A/B give a different set of experiments ?
(is serverB empty, meaning no experiments at all?)
ConvolutedSealion94 try scikit
not scikitlearn
I think we should add a warning if a key is there and is being ignored... let me make sure of that
I just disabled all of them with
auto_connect_frameworks=False
Yep that also works
ConvolutedSealion94 Let me try to explain how it works, I hope this will help in debugging.
There are two different entities here
Clearml-server: In this context clearml server acts as a control-plane, it stores configuration on the different endpoints, models, preprocessign code etc. It does Not perform any compute or serving clearml-serving-inference https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L77 . This ...
DilapidatedDucks58 long story short:
if you do:
` from clearml import StorageManager
from clearml.storage.helper import StorageHelper
StorageHelper.get(" ", retries=5) `It should make sure that all the other s3:// links of this bucket will use the same original configuration (i.e. retries)
If this workaround works let's make sure we add it into the conf file, wdyt ?
Hi PungentLouse55 ,
Yes we have integration with hydra on the todo list since it was first released, we actually know the guy behind Hydra, he is awesome!
What are your thoughts on integration, we would love to get feedback and pointers (Hydra itself is quite capable, and we waiting until we have multiple configuration support, and with v0.16 it was added, so now it is actually possible)
ProudMosquito87 Just a few pointers on how we convert the TB histograms to awesome (but less accurate) 3D surfaces.
First I have to admit, I almost never use these histograms, maybe to detect a plateau of if something goes really wrong...
The 3D surface is basically grouping all the histograms and then bucketing them (I think the default is 50 buckets) so that you get a general feel of what's going on, not necessary a detailed view. Bottom line, you are correct, the TB is the source of truth...
PompousBeetle71 , These are cuda versions, I'm looking for the nvidia driver version for example 440.xx or 418.xx .
The reason is, we set an OS environment for the driver, and I remember that old drivers did not support it . Basically they do not support NVIDIA_VISIBLE_DEVICES=all , so I'm trying to see if that's the case, then we could add fix .