Reputation
Badges 1
25 × Eureka!It's relatively new and it is great as from the usage aspect it is exactly like a user/pass only the pass is the PAT , really makes life easier
okay so the error should have been:
trains_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the TRAINS API server http://<IP>:8008 ?
Not https nor 8010 ?!
Hi @<1610083503607648256:profile|DiminutiveToad80>
do you have a full log? can you share the code you are trying to run?
I'm guessing some network issue, though I can't figure why it cannot connect and curl seems to work
First let's verify the conf:
from clearml.config import config_obj
import json
print(json.dumps(config_obj.get("sdk"), indent=2))
what are you getting
Sorry @<1524922424720625664:profile|TartLeopard58> 😞 we probably missed it
clearml-session is still being developed 🙂
Which issue are you referring to ?
Hmm, this is a good question, I "think" the easiest is to mount the .ssh folder form the host to the container itself. Then also mount clearml.conf into the container with force_git_ssh_protocol: true
see here
https://github.com/allegroai/clearml-agent/blob/6c5087e425bcc9911c78751e2a6ae3e1c0640180/docs/clearml.conf#L25
btw: ssh credentials even though sound more secure are usually less (since they easily contain too broad credentials and other access rights), just my 2 cents 🙂 I ...
Hi PompousParrot44
Let's stick with a single question per thread, it will make my life a lot easier 🙂
What do you mean by "and not in the terminal directly when executed manually through script"?
trains-agent (usually) executed as a daemon pulling jobs and executing them.
The other options is to use it to manually execute a single task.
What am I missing?
BTW: I suspect this is the main issue:
https://github.com/python-poetry/poetry/issues/2179
Hi SubstantialElk6
You are uploading an artifact, a good use case for numpy artifact would be a feature table.
If you want to upload an image use either report_media or report_image or upload PIL image as artifact.
What do you think?
In the agent, no, it pipes stdout/stderr of the container and logs everything 😞
to get a json or something like that?
There is an api to get all the console logs, is this what you are after?
well.. having the demo server by default lowers the effort threshold for trying ClearML and getting convinced it can deliver what it promises, and maybe test some simple custom use cases. I
This was exactly what we thought when we set it up in the first place 🙂
(I can't imagine the cost is an issue, probably maintenance/upgrades ...)
There is still support for the demo server, you just need to set the env key:CLEARML_NO_DEFAULT_SERVER=0 python ...
Hi RoughTiger69
but still get the semantics of knowing when an (external) file changed?
How would you know it changed?
This implies you have a way to verify hash, which means you download the data , no?
HI SubstantialElk6
Yes you are correct the glue only needs to change the yaml and it will work.
When you say "Dev end" , what do you mean? I was thinking adding additional glue for multi node and just adding queues , for example add 4nodes queue and attach a glue to it, wdyt?
Regrading horovod, horovod is spinning its own nodes so integration with k8s is not trivial (regardless of ClearML). That said I know that they do have support for horovod in the Enterprise edition, but I'm not sure ...
Hi GrievingTurkey78
I think the main issue is the lack of support for jsonargparse
, is that correct ?
(vanilla pytorch lightning is using argpraser, which seems to work out of the box)
BoredHedgehog47 you need to configure the clearml k8s glue to spin pods (instead of allocating agents per pods statically) does that make sense ?
you can run md5 on the file as stored in the remote storage (nfs or s3)
s3 is implementation specific (i.e. minio weka wassaby etc, might not support it) and I'm actually not sure regrading nfs (I mean you can run it, but it actually means you are reading the data, that said, nfs by definition I'm assuming is relatively fast access)
wdyt?
and then?
The thing is programmatically this is not easy to do as API, because at the end the "function" (i.e. LCI) never leaves, it connects to the SSH and stays
But you can query the Task it creates, the project is known, the user is known and it is of special type/tag
Could you extend on the use case of #18 ? how would you use it? what problem will it be solving ?
looks like a great idea, I'll make sure to pass it along and that someone reply 🙂
So if I pass a function that pulls the most recent version of a Task, it'll grab the most recent version every time it's scheduled?
Basically you function will be called, that's it.
What I'm assuming is that you would want that function to find the latest Task (i.e. query based & filter based on project/name/tag etc), clone the selected Task and Enqueue it,
is that correct?
No worries, you should probably change it to pipe.start(queue= 'queue')
not start locally
s it working when you are calling it with start locally ?
@<1671689437261598720:profile|FranticWhale40> I might have found something, let me see if I can reproduce it
Hmm, let me check, there is a chance the level is dropped when manually reporting (it might be saved for internal critical reports). Regardless I can't see any reason we could not allow to control it.
Hi @<1658281093108862976:profile|EncouragingPenguin15>
Should work, I'm assuming multiple nodes are running agents ? or are you saying Ray spins the jobs and clearml logs them ?
We could use our 8xA100 as 8 workers, for 8 single-gpu jobs running faster than on a single 1xV100 each.
@<1546665634195050496:profile|SolidGoose91> I think that in order to have the flexibility there you need the "dynamic" GPU allocation that is only part of the "enterprise" offering 😞
That said, why not allocate a single a100 machine? no?
Should I map the poetry cache volume to a location on the host?
Yes, this will solve it! (maybe we should have that automatically if using poetry as package manager)
Could you maybe add a github issue, so we do not forget ?
Meanwhile you can add the mapping here:
https://github.com/allegroai/clearml-agent/blob/bd411a19843fbb1e063b131e830a4515233bdf04/docs/clearml.conf#L137
extra_docker_arguments: ["-v", "/mnt/cache/poetry:/root/poetry_cache_here"]
Also, how would one ensure immutability ?
I guess this is the big question, assuming we "know" a file was changed, this will invalidate all versions using it, this is exactly why the current implementation stores an immutable copy. Or are you suggesting a smarter "sync" function ?