Hi FierceHamster54
Do I need to instantiate a task inside my component ? Seems a bit redundant....
Yes, so the idea is that the Task (along the code) will be automatically linked with the output model, for better traceability.
That said you can "import" a model into the system (i.e. it was created somewhere else and you want to register it with InputModel.import_model
https://clear.ml/docs/latest/docs/clearml_sdk/model_sdk#importing-models
I guess "Input" from that perspecti...
No Task.create is for creating an external Task not logging your own process,
That said you can probably override the git repo with env vars:
None
@<1556812486840160256:profile|SuccessfulRaven86> is the issue with flask
reproducible ? if so could you open a github issue, so we do not forget to look into it?
Okay, so I can't figure why it would "kill" the new experiments, I mean it should run them, but is there any "smart stopping" that causes it to kill he process before it ends ?
BTW: can this be reproduced with the clearml hydra example ?
I cannot modify an autoscaler currently running
Yes this is a known limitation, and I know they are working on fixing it for the next version
We basically have flask commands allowing to trigger specific behaviors. ...
Oh I see now, I suspect the issue is that the flask command is not executed from within the git project?!
try these values:
os.environ.update({
'CLEARML_VCS_COMMIT_ID': '<commit_id>',
'CLEARML_VCS_BRANCH': 'origin/master',
'CLEARML_VCS_DIFF': '',
'CLEARML_VCS_STATUS': '',
'CLEARML_VCS_ROOT': '.',
'CLEARML_VCS_REPO_URL': '
',
})
task = Task.init(...)
Hi @<1556450111259676672:profile|PlainSeaurchin97>
You mean instead of the parallel coordinates ?
None
Long story short, this is done internally when you call the Task.init (I think, there is a chance it is called before)
One way of controlling it would be to have something like:Task.init(auto_connect_frameworks={'hydra': {'log_before_resolve': True}})
That said, I think it will be simpler to store both (in different section of course)
Maybe "Configuration Object: OmegaConf" and "Configuration Object: OmegaConfDefinition" ?
Test it on your local setup (I would hate to push a broken fix)
Is that possible?
Hi @<1523701066867150848:profile|JitteryCoyote63>
Thank you for bringing it! can you verify with the latest clearml-agent 1.5.3rc2
?
Interesting use case, do you already have the connect_configuration
in the code? or do we need to somehow create it ?
ColossalDeer61 btw, it turns out the docker-compose services docker was ill configured on the GitHub 😞 I suggest you get the latest copy of it:curl
-o docker-compose.yml
The wheel you download from pip, for example this one torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl
is actually both CPU and cuda 117
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117.
The thing is, the agent used to do all the heavy parsing because pytorch never actually had a pip compatible artifactory
But now they do, so the agent basically passed the parsing to pip and just added the correct additional pytorch pip repo.
It seems we need to switch back... wdyt?
I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
So I tested the "old" code that did the parsing and matching, and it did resolve to the correct wheel (i.e. found that there is no 117 only 115 and installed this one)
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt?
How so? Installing a local package should work, what am I missing?
Hi @<1526371965655322624:profile|NuttyCamel41>
. I do that because I do not know how to get the pickle file into the docker container
What would the pickle file do?
and load the MinMaxScaler within the script, as the sklearn dependency is missing
what do you mean by that? are you getting an error when loading your model ?
Hi @<1566596960691949568:profile|UpsetWalrus59>
All correct with the exception of " ...or 1GB Metric" this is a limit, since metrics (and meta data) is always stored on the clearml-server, so it is metered. There is also an API limit, basically anti abuse, which of course resets every month, but if you are running tens of experiments at the same time you will hit this limit. Make sense ?
if I run my own ClearML self-hosted server?
Then you have everything on your end, it will not communicate with the saas offering. meaning no limits what so ever.
(That said some of the cloud auto-scaling and compute features are not part of the open source)
Is there no await/synchronize method to wait for task update?
Yes, but then we will have to relaunch it (not unthinkable), but I'm still looking for the intimidate value of doing all that work, wdyt?
I'm already at 300MB of usage with just 15 tasks
Wow, what do you have there? I would try to download the console logs and see what the size you are getting, this is the only thing that makes sense, wdyt?
BTW: to get the detailed size for scalars, maximize the plot (otherwise you are getting "subsampled" data)
I can definitely feel you!
(I think the implementation is not trivial, metrics data size is collected and stored as commutative value on the account, going over per Task is actually quite taxing for the backend, maybe it should be an async request ? like get me a list of the X largest Tasks? How would the UI present it? As fyi, keeping some sort of book keeping per task is not trivial either, hence the main issue)
Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/
Regrading the first direction, this was just pushed 🙂
https://github.com/allegroai/clearml/commit/597a7ed05e2376ec48604465cf5ebd752cebae9c
Regrading the opposite direction:
That is a good question, I really like the idea of just adding another section named Datasets
SucculentBeetle7 should we do that automatically?
Hi DepressedChimpanzee34
Why do you need to have the configuration added manually ? isn't the cleaml.conf easier ? If not I think OS environments are easier no? I run run above code, everything worked with no exception/warning... What is the try/except solves exactly ?
trains-agent runs a container from that image, then clones ...
That is correct
I'd like the base_docker_image to not only be defined at runtime
I see, may I ask why not just build it once, push it into artifactory and then have trains-agent
use it? (it will be much faster)
Hi @<1635088270469632000:profile|LividReindeer58>
You mean the clearml.conf?
You can do:
from clearml.config import config_obj
you should have the entire configuration file as an object (dict interface)
fyi: under the hood it uses pyHOCON
Hi @<1545216070686609408:profile|EnthusiasticCow4>
My biggest concern is what happens if the TaskScheduler instance is shutdown.
good question, follow up, what happens to the cron service machine if it fails?!
TaskScheduler instance is shutdown.
And yes you are correct if someone stops the TaskScheduler instance
it is the equivalent of stopping the cron service...
btw: we are working on moving some of the cron/triggers capabilities to the backend , it will not be as flexi...