Reputation
Badges 1
979 × Eureka!Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc
is not ava...
There it is: https://github.com/allegroai/clearml/issues/493
AppetizingMouse58 After some thoughts, we decided to install from scratch 0.16, with no data migration, because we believe this was an edge case not worth spending efforts on. Thank you very much for your help there, very appreciated. You guys rock! ๐
that would work for pytorch and clearml yes, but what about my local package?
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb ๐ค The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d
afterwards, and not docker-compose restart
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact()
in the last step task
ok, what is your problem then?
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
?
Could you please share the stacktrace?
This https://discuss.elastic.co/t/index-size-explodes-after-split/150692 seems to say for the _split API such situation happens and solves itself after a couple fo days, maybe the same case for me?
Thanks! I would like to use this opportunity to split the indices into multiple shards, as explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#indices-split-index
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
the reindexing operation showed no error and copied everything
Thanks! Unfortunately still not working, here is the log file:
What is this cleanup service? where is it available?
without the envs, I had error: ValueError: Could not get access credentials for '
s3://my-bucket ' , check configuration file ~/trains.conf
After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'
PS: in the new env, Iโv set num_replicas: 0, so Iโm only talking about primary shardsโฆ
Thanks for the clarification SuccessfulKoala55 ! A follow-up question:
I would like to install several packages (opencv, numpy, torch) in the system-site-packages
so that they are available in each experiment (to reduce setup time of the experiments). Installing them globally via
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
Maybe there is setting in docker to move the space used in a different location? I can simply increase the storage of the first disk, no problem with that
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
Awesome! Thanks! ๐
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly