Reputation
Badges 1
25 × Eureka!RobustSnake79 I have not tested, but I suspect that currently all the reports will stay in TB and not passed automagically into ClearML
It seems like something you would actually want to do with TB (i.e. drill into the graphs etc.) no?
(the payload is not the correct form, can that be a problem?
It might, but I assume you will get a different error
RobustGoldfish9 I see.
So in theory spinning an experiment on an gent would be clone code -> build docker -> mount code -> execute code inside docker?
(no need for requirements etc.?)
What do you mean by a custom queue ?
In the queues page you have a plus button, this will just create a new queue
Can you verify this example is not working for you?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
Thanks SmallDeer34 !
Would you like us to? How about a footnote/acknowledgement?
How about a reference / footnote ?@misc{clearml, title = {ClearML - Your entire MLOps stack in one open-source tool}, year = {2019}, note = {Software available from }, url={ }, author = {allegro.ai}, }
Yes, I think you are correct, verified on Firefox & Chrome. I'll make sure to pass it along.
Thanks SteadyFox10 !
Are you running a jupyter notebook inside vscode ?
This looks strange that only a single scalar is reported.
Thanks ScantChimpanzee51 !
Let me see what I can find, should be easy enough to fix now 🙂
os.environ['TRAINS_PROC_MASTER_ID'] = args.trains_idit should be '1:'+args.trains_id
os.environ['TRAINS_PROC_MASTER_ID'] = '1:{}'.format(args.trains_id)Also str(randint(1, sys.maxsize))
it handles 2FA if my repo lies in Github and my account needs 2FA to sign in
It does not 😞
Hi RoundMosquito25
Hmm I remember this is tricky ... What's the clearml version? also where is the line you had to hack ?
Hi RattySeagull0
I'm trying to execute trains-agent in docker mode with conda as package manager, is it supported?
It should, that said we really do not recommend using conda as package manager (it is a lot slower than pip, and can create an environment that will be very hard to reproduce due to internal "compatibility matrix" of conda, that might be changing from one conda version to another)
"trains_agent: ERROR: ERROR: package manager "conda" selected, but 'conda' executable...
ShallowCat10 Thank you for the kind words 🙂
so I'll be able to compare the two experiments over time. Is this possible?
You mean like match the loss based on "images seen" ?
BeefyCow3 see this https://allegroai-trains.slack.com/archives/CTK20V944/p1593077204051100 :)
which part of the code?
the main script?!
but is not part of the package
is the repo it self a package ?
Is there a way to do this all elegantly?
Of yes there is, this is how TaskB code will look:
` task = Task.init(..., 'task b')
param = {'TaskA' :'TaskAs ID HERE'}
task.connect(param)
taska_model = Task.get_task(param['TaskA']).models['output''][-1]
torch.load(taska_model.get_local_copy())
train
torch.save('modelb') `I might have missed something there, but generally speaking this will let you:
Select TASKA as a parameter of TaskB training process Will register automagically Tasks'A...
might it be related to the docker socket not being mounted to the agent daemon running inside a docker container?
Oh yes, if the daemon is running Inside a docker container than you need both --privileged and mounting of the docker socket, to get it to work
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restartDid you try the above, and you are still getting the same error ?
CourageousLizard33 VM?! I thought we are talking fresh install on ubuntu 18.04?!
Is the Ubuntu in a VM? If so, I'm pretty sure 8GB will do, maybe less, but I haven't checked.
How much did you end up giving it?
CourageousLizard33 Are you using the docker-compose to setup the trains-server?
CourageousLizard33 so you have a Linux server running Ubuntu VM with Docker inside?
I would imagine that you could just run the docker on the host machine, no?
BTW, I think 8gb is a good recommendation for a VM it's reasonable enough to start with, I'll make sure we add it to the docs
Everything seems correct...
Let's try to set it manually.
create a file ~/trains.conf , then copy paste the credentials section from the UI, it should look something like:api { web_server: http:127.0.0.1:8080 api_server: http:127.0.0.1:8008 files_server: http:127.0.0.1:8081 credentials { "access_key" = "access" "secret_key" = "secret" } }Let's see if that works
This means that if something happens with the k8s node the pod runs on,
Actually if the pod crashed (the pod not the Task) k8s should re spin it, no?
I also experience that if a worker pod running a task is terminated, clearml does not fail/abort the task.
From the k8s perspective, if the task ended (failed/completed) it always return with exit code 0, i.e. success. Because the agent was able to spin the Task. We do not want Tasks with exception to litter the k8s with endless r...
My question is what happens if I launch in parallel multiple doit commands that create new Tasks.
Should work out of the box.
I would like to confirm that current_task ...
Correct.
Could you post what you see under "installed packages" in the UI ?
but DS in order for models to be uploaded,
you still have to set:
output_uri=True
in the
No, if you set the default_output_uri, there is no need to pass output_uri=True in the Task.init() 🙂
It is basically setting it for you, make sense ?