Hi UnsightlySeagull42
How can I reproduce this behavior ?
Are you getting all the console logs ?
Is it only the Tensorboard that is missing ?
😞 It's working as expected for me...
That said I tested on Linux & pip,
Any specific req to test with? from the log I see this is conda on windows, are you using the base conda env or a venv inside conda?
okay, just so I understand, this is what you have on your client that can connect with the server:api { api_server:
web_server:
files_server:
credentials {"access_key": "KEY", "secret_key": "SECRET"} }
Hi ThickDove42 ,
Yes, but by the time you will be able to access it, it will be in a display form (plotly), not very convient.
If this is something you need to re-use, I would argue that it is an artifact and should be stored as artifact (then accessing it is transparent) , obviously you can both report as table and upload as artifact, no harm in that.
what do you think?
If you spin two agent on the same GPU, they are not ware of one another ... So this is expected behavior ...
Make sense ?
You mean I can do Epoch001/ and Epoch002/ to split them into groups and make 100 limit per group?
yes then the 100 limit is per "Epoch001" and another 100 limit for "Epoch002" etc. 🙂
TRAINS_WORKER_NAME=first_agent trains-agent --gpus 0
andTRAINS_WORKER_NAME=second_agent trains-agent --gpus 0
curl seems okay, but this is odd https://<IP>:8010
it should be http://<IP>:8008
Could you change and test?
(meaning change the trains.conf and run trains-agent list
)
Specifically your error seems to be an issue with nvidia Triton container upgrade
For .git-credentials remove the git_pass/git_user from the clearml.conf
If you want to use ssh you need to also add:force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/a2db1f5ab5cbf178840da736afdc370cfff43f0f/docs/clearml.conf#L25
you can also increase the limit here:
https://github.com/allegroai/clearml/blob/2e95881c76119964944eaa0289549617e8afeee9/docs/clearml.conf#L32
I think you are correct, this is odd, let me check ...
clearml will register conda packages that cannot be installed if clearml-agent is configured to use pip. So although it is nice that a complete package list is tracked, it makes it cumbersome to rerun the experiment.
Yes mixing conda & pip is not supported by clearml (or conda or pip for that matter)
Even python package numbers might not exist on both.
We could add a flag not to update back the pip freeze, it's an easy feature to add. I'm just wondering on the exact use case
preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
Okay that's the part that I'm missing, how come in the first run the package existed and in the cloned Task they are missing? I'm assuming agents are configured basically the same (i.e. docker mode with the same network access). What did I miss here ?
ReassuredTiger98 both are running with pip as package manager, I thought you mentioned conda as package manager, no?agent.package_manager.type = pip
Also the failed execution is looking for "ruamel_yaml_conda" but it is nowhere to be found on the original one?! how is that possible ?
(also could you make sure all posts regrading the same question are put in the thread of the first post to the channel?)
Hi JealousParrot68
You mean by artifact names ?
Hmm SuccessfulKoala55 what do you think?
I did not start with python -m, as a module. I'll try that
I do not think this is the issue.
It sounds like anything you do on your specific setup will end with the same error, which might point to a problem with the git/folder ?
@<1546303293918023680:profile|MiniatureRobin9>
, not the pipeline itself. And that's the last part I'm looking for.
Good point, any chance you want to PR this code snippet ?
def add_tags(self, tags):
# type: (Union[Sequence[str], str]) -> None
"""
Add Tags to this pipeline. Old tags are not deleted.
When executing a Pipeline remotely (i.e. launching the pipeline from the UI/enqueuing it), this method has no effect.
:param tags: A li...
Nice! I'll see if we can have better error handling for it, or solve it altogether 🙂
PompousBeetle71 could you check that the "output:destination" is the same for both experiments ?
Hmmm, are you running inside pycharm, or similar ?
Hi @<1546303293918023680:profile|MiniatureRobin9> could it be the pipeline logic is created via the clrarml-task CLI? If this is the case, I think this is an edge case we should fix. Basically it creates a Task instead of pipeline, which in.essence only effects the UI. To solve it, just run the pipeline locally, notice that by default when you start it, it will actually stop the local run and relaunch itself on an agent.
Also, could you open a GitHub issue so we add a flag for it?
Hi ReassuredTiger98
When clearml is running inside the docker the installed packages of the WebUI get updated.
Yes, this is by design, so the agent can always reproduce the exact python environment.
(internal the original requirements is also stored, but not available in the UI).
What exactly is the use case here ? wouldn't make sense to reproduce the entire working environment when you clone the executed Task ?
Then when ran a second time, the task will contain the requirements of the (conda-) environment from the first run.
What you see in the log under "Summary - installed python packages:" will be exactly what is updated on the Task. But it does not contain the "ruamel_yaml_conda" package, this is what I cannot get...
But I did find this part:ERROR: conda 4.10.1 requires ruamel_yaml_conda>=0.11.14, which is not installed.
Which point to conda needing this package and then failing to i...
Hi ReassuredTiger98
Could you send the log of both run ?
(I'm not sure this is a bug, or some misconfiguration , but the scenario should have worked...)
I think my main point is, k8s glue on aks or gke basically takes care of spinning new nodes, as the k8s service does that. Aws autoscaler is kind of a replacement , make sense?