Reputation
Badges 1
25 × Eureka!We are here if you need further help š
SmarmySeaurchin8 just so that I don't miss anything.
One machine, two trains-agents each one connected to a different trains-server, correct ?
from the trains-agent --help
trains-agent --config-file /home/user/my_trains_server1.conf daemon trains-agent --config-file /home/user/my_trains_server2.conf daemon
Check the examples on the github page, I think this is what you are looking for š
https://github.com/allegroai/trains-agent#running-the-trains-agent
Hi ItchyJellyfish73
This seems aligned with scenario you are describing, it seems the api server is overloaded with simultaneous connections.
Add an additional apiserver instance to the docker-compose and an nginx as load balancer:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L4
`
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-sto...
Seems the apiserver is out of connections, this is odd...
SuccessfulKoala55 do you have an idea ?
This one should work:
` path = task.connect_configuration(path, name=name)
if task.running_locally():
my_params = read_from_path(path)
my_params = change_parmas(my_params) # change some staff
store back the change, my_params assumed to be the content of the param file (text)
task.set_configuration_object(name=name, config_taxt=my_params) `
Still I wonder if it is normal behavior that clearml exits the experiments with status "completed" and not with failure
Well that depends on the process exit code, if for some reason (not sure why) the process exits with return code 0, it means everything was okay.
I assume this is doing something "Detected an exited process, so exiting main" this is an internal print of your code, I guess it just leaves the process with exitcode 0
Woot woot
ChubbyLouse32 when you get it working please PR it, this is very very cool!
(I'll be happy to help š )
Okay found the issue, to disable SSL verification global add the following env variable:CLEARML_API_HOST_VERIFY_CERT=0
(I will make sure we fix the actual issue with the config file)
Hmm, let me check, there is a chance the level is dropped when manually reporting (it might be saved for internal critical reports). Regardless I can't see any reason we could not allow to control it.
Iām not sure if this was solved, but I am encountering a similar issue.
Yep, it was solved (I think v1.7+)
With
spawn
and
forkserver
(which is used in the script above) ClearML is not able to automatically capture PyTorch scalars and artifacts.
The "trick" is to have Task.init before you spawn your code, then (since your code will not start from the same state), you should call Task.current_task(), which would basically make sure everything is...
Hi ClumsyElephant70
What's the clearml
you are using ?
(The first error is a by product of python process.Event created before a forkserver is created, some internal python issue. I thought it was solved, let me take a look at the code you attached)
What sort of data would be stored in the
venvs-build
folder?
ClumsyElephant70 temporary (lifetime of the task execution) virtual environment, including the code etc. It is deleted and recreated for every new task launched (or restored from cache, if venvs_cache is enabled)
Hi, I would like to understand how I can set the pip cache location for my agent,
ClumsyElephant70 by default the pip cache (and all other cache folders) are mounted back into the host itself ~/.clearml/
I'm assuming the idea is shared cache, if this is the case, do:docker_pip_cache = ~/my_shared_nfs/pip-cache
https://github.com/allegroai/clearml-agent/blob/e3e6a1dda81bee2dd20a64d09746568e415f1823/docs/clearml.conf#L139
TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
ClumsyElephant70
Can you manually run the same command ?['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']
Basically:python3.6 -m virtualenv /home/user/.clearml/venvs-builds/3.6'
i'm sorry, I mean if the queue name is not provided to the agent , the agent will look for the queue with the "default" tag. If you are specifying the queue name, there is no need to add the tag.
Is it working now?
okay that's good, that means the agent could run it.
Now it is a matter of matching the TF with cuda (and there is no easy solution for that). Basically I htink that what you need is "nvidia/cuda:10.2-cudnn7-runtime-ubuntu16.04"
Seems like settings on the clearml-server disappeared (specifically default queue tag?!)
It seems like you are correct, everything should just work. Are you still getting the error? What's the clearml agent version?
in the docker-compose file. Still strange...
hmm yes it is... If you have an idea on what went wrong let me know, we would love to fix it
Is this still an issue (if you provide queue name, the default tag is not used so no error should be printed)
is there a way for me to get a link to the task execution? I want to write a message to slack, containing the URL so collaborators can click and see the progress
WackyRabbit7 Nice!
basically you can use this one:task.get_output_log_web_page()
PreciousParrot26 I think this is really a matter of the CI process having very limited resources. just to be clear, you are correct and the steps them selves are Not executed inside the CI environment, but it seems that even running the pipeline logic is somehow "too much" for the limited resources... Make sense ?
OSError: [Errno 28] No space left on device
Hi PreciousParrot26
I think this says it all š there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
controller_object.start_locally()
. Only the pipelinecontroller should be running locally, right?
Correct, do notice that if you are using Pipeline decorator and calling run_locally()
the actual the pipeline steps are also executed locally.
which of the two are you using (Tasks as steps, or functions as steps with decorator)?
Good point!
I'll make sure we do š
When using the UI with regex to search for experiments, due to the greedy nature of the search, it consistently pops up the "ERROR Fetch Experiments failed" window when starting to use groups in regex (that is, parentheses of any kind).
hmm that is a good point (i.e. only on enter it would actually search)
Could it be updated so that if an invalid regex pattern is given, it simply highlights the search bar in red (or similar) rather than stop us while writing the search pattern?
...