Reputation
Badges 1
979 × Eureka!I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d
afterwards, and not docker-compose restart
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact()
in the last step task
ok, what is your problem then?
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
?
mmmh probably yes, I can’t say for sure (because I don’t remember precisely when I upgraded to 0.17) but it looks like that
Could you please share the stacktrace?
This https://discuss.elastic.co/t/index-size-explodes-after-split/150692 seems to say for the _split API such situation happens and solves itself after a couple fo days, maybe the same case for me?
Thanks! I would like to use this opportunity to split the indices into multiple shards, as explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#indices-split-index
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
the reindexing operation showed no error and copied everything
Thanks! Unfortunately still not working, here is the log file:
What is this cleanup service? where is it available?
without the envs, I had error: ValueError: Could not get access credentials for '
s3://my-bucket ' , check configuration file ~/trains.conf
After using envs, I got error: ImportError: cannot import name 'IPV6_ADDRZ_RE' from 'urllib3.util.url'
PS: in the new env, I’v set num_replicas: 0, so I’m only talking about primary shards…
Thanks for the clarification SuccessfulKoala55 ! A follow-up question:
I would like to install several packages (opencv, numpy, torch) in the system-site-packages
so that they are available in each experiment (to reduce setup time of the experiments). Installing them globally via
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
Ok yes, I get it, this info is also available at the very beginning of the logs, where the agent logs the full docker run command, this docker_cmd is a shorter version?
Maybe there is setting in docker to move the space used in a different location? I can simply increase the storage of the first disk, no problem with that
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)