Ohh I see now the force SSH did not replace the user in the SSH link (only if the original was http), right ?
What's the exact error you are getting ?
(Maybe this is privilege error on the cache folder, what are the folders it is using, you can see in the configuration as well)
Hi @<1545216070686609408:profile|EnthusiasticCow4>
is there a way to get the date from the InputModel?
You should be able to with model._get_model_data()
But I think we should have it all exposed, wdyt?
Hi TenseOstrich47 whats the matplotlib version and clearml version you are using ?
Hi SkinnyPanda43
Can you attache the full log?
Clearml agent is installed before your requirements.txt , at least in theory it should not collide
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
DrabSwan66
Did you set "docker_install_opencv_libs: true" in your clearml.conf on the host machine ?
https://github.com/allegroai/clearml-agent/blob/e416ab526ba9fe05daa977b34c9e46b50fb214a0/docs/clearml.conf#L150
Just making sure, you are running clearml-agent in docker mode, correct?
What's the container you are using ?
Okay, some progress, so what is the difference ?
Any chance the issue can be reproduced with a small toy code ?
Can you run the tqdm loop inside the code that exhibits the CR issue ? (maybe some initialization thing that is causing it to ignore the value?!)
Omg that's a lot of submodules!
It has nothing with what the tasks sees if you are inside a git repo you will have to cone it on the remote machine. Let me check in the code maybe you have a workaround
The remaining problem is that this way, they are visible in the ClearML web UI which is potentially unsafe / bad practice, see screenshot below.
Ohhh that makes sense now, thank you ๐
Assuming this is a one time credntials for every agent, you can add these arguments in the "extra_docker_arguments" in clearml.conf
Then make sure they are also listed in: hide_docker_command_env_vars which should cover the console log as well
https://github.com/allegroai/clearml-agent/blob/26e6...
Hi MiniatureShells8
The torch.save triggers the model creation.
If you are using the same filename, then the same model in the system will be used.
New filenames will create new models.
What exactly is your use case ?
but instead, they cannot be run if the files they produce, were not committed.
The thing with git, if you have new files and you did not add them, they will not appear in the git diff, hence missing when running from the agent. Does that sound like your case?
Yes, that sounds like the issue, is the file actually there ?
Yes you can ๐ (though not on the open-source version)
PipelineController works with default image, but it incurs overhead 4-5 min
You can try to spin the "services" queue without docker support, if there is no need for containers it will accelerate the process.
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
This error is about failing to clone the pipeline code repo, how is that connected to changing the container ?!
Can you provide the full log?
Hi CluelessElephant89
hey guys, I believeย
clearml-agent-services
ย isn't necessary right?
Generally speaking, yes you are corrected ๐
Specifically, this is the "services" queue agent, running your pipeline logic, services etc.
But it is not a must to get the server to work, and you can also spin it on a different host
I managed to set up my (Windows) laptop as a worker and reproduce the issue.
Any insight on how we can reproduce the issue?
It runs into the above error when I clone the task or reset it.
from here:
AssertionError: ERROR: --resume checkpoint does not exist
I assume the "internal" code state changed, and now it is looking for a file that does not exist, how would your code state change, in other words why would it be looking for the file only when cloning? could it be you put the state on the Task, then you clone it (i.e. clone the exact same dict, and now the newly cloned Task "thinks" it resuming ?!)
Docker would recognise that image locally and just use it right? I wonโt need to update that image often anyway
Correct ๐
Is there a way to connect to the task without initiating a new one without overriding the execution?
You can, but not with automagic, you can manually send metrics/logs...
Does that help? or do we need the automagic?
Hi @<1523719753099644928:profile|ImmenseMole52>
but tasks of this pipeline dont inherit docker and packages, why? I want to build or pull one docker and env for all pipeline steps only once, so ow can i do it?
you have to specify the docker image for the pipeline Tasks, by default it will not assume it is the same as the pipeline controller, basically just pass:
pipe.add_function_step(
name="load_data",
function=load_data,
function_kwargs={"config": conf...
What do you have under the "installed packages" section? Also you can configure the agent to use poetry to restore the environment (instead of pip)
of that makes sense, basically here is what you should do:
Task.init(... output_uri='
')
output_model.update_weights(register_uri=model_path)
It will automatically create a unique target folder / file under None to store your model
(btw: passing the register_uri basically sais: "I already uploaded the model there, just store the link" - i.e. does Not upload the model)
Hi WackyRabbit7
the services (or the agent running there) is spinning multiple Tasks (as opposed to regular agent where it is one task at a time).
how can I give this agent git access?
in the docker-compose you can configure the git credentials (user/pass or user/key it is the same).
https://github.com/allegroai/clearml-server/blob/d0e2313a24eb1248ebf0ddf31bf589de0d675562/docker/docker-compose.yml#L137
The import process actually creates a new Task every import, that said if you take a look here:
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/trains/task.py#L1733
you can pass a pre-existing Task ID to "import_task" https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/trains/task.py#L1653
Ohh sorry you will also need to fix the
def _patched_task_function
The parameter order is important as the partial call relies on it.
My bad no need for that ๐
I think the real issue is that I am not able to specify a platform for the model,
None
there is no need to specify it, remove it from the config.pbtxt - the clearml-serving will automatically add the background
Hello guys, i have 4 workers (2 in default and 2 in service queue on same machine)
Hi @<1526734437587357696:profile|ShaggySquirrel23>
I think what happens is one agent is deleting it's cfg file when it is done, but at least in theory each one should have it's own cfg
One last request can you try with the agent's latest RC version 1.5.3rc2 ?