Reputation
Badges 1
979 × Eureka!mmmh probably yes, I can’t say for sure (because I don’t remember precisely when I upgraded to 0.17) but it looks like that
Could you please share the stacktrace?
This https://discuss.elastic.co/t/index-size-explodes-after-split/150692 seems to say for the _split API such situation happens and solves itself after a couple fo days, maybe the same case for me?
Thanks! I would like to use this opportunity to split the indices into multiple shards, as explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#indices-split-index
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
the reindexing operation showed no error and copied everything
Thanks! Unfortunately still not working, here is the log file:
What is this cleanup service? where is it available?
without the envs, I had error: ValueError: Could not get access credentials for '
s3://my-bucket ' , check configuration file ~/trains.conf
After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'
PS: in the new env, I’v set num_replicas: 0, so I’m only talking about primary shards…
Thanks for the clarification SuccessfulKoala55 ! A follow-up question:
I would like to install several packages (opencv, numpy, torch) in the system-site-packages
so that they are available in each experiment (to reduce setup time of the experiments). Installing them globally via
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
Ok yes, I get it, this info is also available at the very beginning of the logs, where the agent logs the full docker run command, this docker_cmd is a shorter version?
Maybe there is setting in docker to move the space used in a different location? I can simply increase the storage of the first disk, no problem with that
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?