Can you fix locally, just to verify ?
I cannot test it at the moment, hence my question.
JuicyFox94 any chance you can blindly approve ?
Merged, is it working for you now?
Yes! Thanks so much for the quick turnaround
My pleasure 🙂
BTW: did you see this (it seems like the same bug?!)
https://github.com/allegroai/clearml-helm-charts/blob/0871e7383130411694482468c228c987b0f47753/charts/clearml-agent/templates/agentk8sglue-configmap.yaml#L14
in order to work with ssh cloning, one has to manually install openssh-client to the docker image, looks like that
Correct, you have to have SSH inside the container so that git can use it.
You can always install with the following setup inside your agent's clearml.conf:extra_docker_shell_script: ["apt-get install -y openssh-client", ]
https://github.com/allegroai/clearml-agent/blob/73625bf00fc7b4506554c1df9abd393b49b2a8ed/docs/clearml.conf#L145
Hi MelancholyElk85
I have strong deja vu feeling. Credentials are OK. How to solve this? If you need the full log, how to share the full log without sharing private information? I'm fed up with this shit
Is this coming from the agent ?
I don't think so. it is solved by installing openssh-client to the docker image or by adding deploy token to the cloning url in web ui
You can also have the token (token==password) configured as the defauylt user/pass in your agent's clearml.conf
https://github.com/allegroai/clearml-agent/blob/73625bf00fc7b4506554c1df9abd393b49b2a8ed/docs/clearml.conf#L19
I execute theÂ
clearml-session
 withÂ
--docker
 flag.
This is to control the docker image the agent will spin for you (think dev enviroment you want to work in, like nvidia pytorch container already having everything you need)
DilapidatedDucks58 I see ...
This might be more complicated that one would imagine, a simple solution might be to store a snapshot of the values every-time we reach a new maximum, a quick hack might be to add it as text on one of the task's parameters or properties (that we can later add to the table as custom column).
wdyt?
server-->agent is fast, but agent-->server is slow.
Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)
No sure I follow, you mean to launch it on the kubernretes cluster from the ClearML UI?
(like the clearml-k8s-glue ?)
MysteriousBee56 that is so weird ... last one, I promise 🙂docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && echo \$(which python3) && echo \$(which trains-agent)"
ohh right, my bad:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && pip install trains-agent && echo done"
Hi PompousBeetle71 , this actually fits with other feedback we received.
And for that reason it is already being worked on! 🙂
I have a few questions as we are designing the new interface.
I think our biggest question was, are projects like folders?
That is: I can have experiments in a project, but also sub-projects?
Or parent projects are a way to introduce hierarchy into the mess, which means a project has either experiments in it, or sub-projects, but not both
(obviously in both cases...
Hi JitteryCoyote63
What do you have in the agent.cuda_version
?
(you can see it printed at the beginning of the log)
@<1523701079223570432:profile|ReassuredOwl55> did you try adding manually ?
./path/to/package
You can also do that from code:
Task.add_requirements("./path/to/package")
# notice you need to call Task.add_requirements before Task.init
task = Task.init(...)
WickedGoat98 this is awesome! Let me know how I could help 🙂
BTW: I checked regrading the plot comparison, this is a BE issue due to the size of the plot, I was told a fix will be deployed in a day or two.
I useÂ
torch.save
 to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?
I'm assuming the upload is http upload (e.g. the default files server)?
If this is the case, the main issue we do not have callbacks on http uploads to update the progress (which I would love a PR for, but this is actually a "requests" issue)
I think we had a draft somewhere, but I'm not sure ...
This is a part of a bigger process which times quite some time and resources, I hope I can try this soon if this will help get to the bottom of this
No worries, if you have another handle on how/why/when we loose the current Task, please share 🙂
Is this reproducible with the hpo example here:
https://github.com/allegroai/clearml/tree/400c6ec103d9f2193694c54d7491bb1a74bbe8e8/examples/optimization/hyper-parameter-optimization
What's your clearml version? (And is it possible you verify with the latest version?)
Hi @<1536881167746207744:profile|EnormousGoose35>
, Could we just share the entire project instead of Workspace ?
You mean allow access to a project between workspaces ?
If the answer is yes, then unfortunatly the SaaS version (app.clear.ml) does not really support these level of RBAC, this is part of the enterprise version, which assumes a large organization with the need for that kind of access limit.
What is the use case ? Why not just share the entire workspace ?
Hey WickedGoat98
I found the bug, it is due to the fact the numpy (passed to plotly) contains both datetime and nan, and plotly.js does not like it. I'll make sure this is fixed, in the meantime you can just remove the first row (it contains the nan):df = pd.concat([tickerDf.Close, tickerDf_Change.Close_pcent], axis=1) df = df[1:]
ReassuredTiger98 after 20 hours, was it done uploading ?
What do you see in the Task resource monitoring? (notice there is network_tx_mbs
metric that should be accordig to this, 0.152)
task.wait_for_status() task.reload() task.artifacts["output"].get()
Hi ShakyJellyfish91
It seems clearml is using a single connection, that takes a long time download
Hmm, I found this one:
https://github.com/allegroai/clearml/blob/1cb5dbb276026644ae20fef63d58256cdc887818/clearml/storage/helper.py#L1763
Does max_connections=10
mean 10 concurrent connections ?
I called task.wait_for_status() to make sure the task is done
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
in clearml.conf we could have:azure.storage { max_connections = 10 # containers: [ # { # account_name: "clearml" # account_key: "secret" # # container_name: # } # ] }
Then in AzureContainerConfigurations
:
` @classmethod
def from_config(cls, configuration):
...
class AzureContainerConfigurations(object):
def init(self, container_configs=None, max_connections=None):
...