ShakyJellyfish91 what exactly are you passing to Task.create?
Could it be you are only passing script=
and leaving repo=
None ?
Thanks ShakyJellyfish91 ! please let me know what you come up with, I would love for us to fix this issue.
ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)
Just verifying the Pod does get allocated 2 gpus, correct ?
What do you have under the "script path" in the Task?
okay this seems like a broken pip install python3.6
Can you verify it fails on another folder (maybe it's a permissions thing, for example if you run in docker mode, then the permissions will be root, as the docker is creating those folders)
WickedGoat98
The webUI will look like the demo server 🙂https://demoapp.trains.allegro.ai/
2. curl http://server-ip:8008 should return something like:{"meta":{"id":"78a9dc77081348e2930d1f429fd7e092","trx":"78a9dc77081348e2930d1f429fd7e092","endpoint":{"name":"","requested_version":1.0,"actual_version":null},"result_code":400,"result_subcode":0,"result_msg":"Invalid request path /","error_stack":null},"data":{}}%
3. curl http://server-ip:8080 should return something like:
` <!d...
pip install clearml==1.0.6rc2
Did not work?!
ClumsyElephant70
Can you manually run the same command ?['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']
Basically:python3.6 -m virtualenv /home/user/.clearml/venvs-builds/3.6'
Hi
The Squash operation copies all the data and is no longer linked to previous commits?
Yes, basically the idea is if you have data version that relies on many parents that needs to be merged, the squash will create a merged copy and push it all as a single version, and then yes the parent versions are no longer needed
I thought this operation is like git squash but it seems to me
yeah... we did not want to actually delete the parents because unlike git, the operation is done ...
None
notice there is a scroll_id there, you might need to call the API multiple times until you scroll over All the events
could that be it?
Notice that you need to pass the returned scroll_id to the next call
scroll_id = response["scroll_id"]
Seems like settings on the clearml-server disappeared (specifically default queue tag?!)
the first runs perfectly fine,
Just making sure, running in an agent?
the second crashes
Running inside the same container as the first one ?
DefeatedOstrich93 what do you mean by "I am wondering why do I need to create files before applying diff ?"git diff
will not list files unless their are added (they are marked as "untracked") think temp files logs etc. until you add a file to git it will basically ignore that file. Make sense ?
but is there any other way to get env vars / any value or secret from the host to the docker of a task?
if this is docker -e/--env as argument would do the same-e VAR=somevalue
in the docker-compose file. Still strange...
hmm yes it is... If you have an idea on what went wrong let me know, we would love to fix it
Oh I see, yes the "metrics" include both scalars / plots & console outputs,
I also think they are updated only once a day (or maybe twice a day?) so even if you delete them it will take to update
(archive is not delete, you then need to go to the archived view and delete it from there)
It seems like you are correct, everything should just work. Are you still getting the error? What's the clearml agent version?
but this would be still part of the clearml.conf right?
You can pass it per Task , also you can configure the agent to always pass it add this env.
https://github.com/allegroai/clearml-agent/blob/5a080798cb4292e198948fbe16cba70136cb6bdf/docs/clearml.conf#L137
Sure SharpDove45 ,from clearml import Model model = Model('model_id_aabbcc') model.system_tags += ['archived']
Hi @<1695969549783928832:profile|ObedientTurkey46>
Use --services-mode in the agent , it will run many Tasks on the same machine, this is usually associated with the services queue, but can be run on any queue. This way you could have the same machine easily running those multiple "control" tasks.
wdyt?
SoggyBeetle95 maybe it makes sense to configure the agent with an access-all credentials? Wdyt
Hi MortifiedCrow63
saw
, ...
By default ClearML
will only log the exact local place where you stored the file, I assume this is it.
If you pass output_uri=True
to the Task.init
it will automatically upload the model to the files_server and then the model repository will point to the files_server (you can also have any object storage as model storage, e.g. output_uri=s3://bucket
)
Notice yo...
it seems it's following the path of the script i'm using to task.create, eg:
The folder it should run it is the script path you are passing (i.e. "script=ep_fn," )
Wrong path would imply that is it not finding the correct repository, is that the case ?
TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
NonchalantDeer14
I think the issue is the way it spins the subprocess is not with fork but with Popen, so clearml is not "loaded" into the subprocess hence no logging.
The easiest fix is to call Task.current_task() inside the actual code (somewhere when it starts), it should trigger clearml.
ClumsyElephant70 the odd thing is the error here:docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.
I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
)?