Reputation
Badges 1
979 × Eureka!I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
Hi AgitatedDove14 , Here is the full log.
Both python versions (local and remote) are python 3.6 Locally (macos), I get pytorch3d== (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0, 0.4.0, 0.5.0)
Remotely (Ubuntu), I get (from versions: 0.0.1, 0.1.1, 0.2.0, 0.2.5, 0.3.0)
So I guess itβs not related to clearml-agent really, rather pip that cannot find the proper wheel for ubuntu for latest versions of pytorch3d, right? If yes, is there a way to build the wheel on the remote machine...
yea I just realized that you would also need to specify different subnets, etcβ¦ not sure how easy it is π But it would be very valuable, on-demand GPU instances are so hard to spin up nowadays in aws π
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
How about the overhead of running the training on docker on a VM?
AgitatedDove14 I now tested with a real experiment, it works, but I saw two issues:
It first doesnt detect torch, downloads it but then says that it is already installed so it doesn't install it. One of the dependency of my repository is another repository (repo-2 in the logs). Both my repositories require numpy
. When installing the first repository, it says Requirement already satisfied: numpy in /home/workeruser/.local/lib/python3.6/site-packages
. Correct. But then it says `...
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)
ha wait, I removed the http://
in the host and it worked π
Interesting idea! (I assume for reporting only, not configuration)
Yes for reporting only - Also to understand which version is used by the agent to define the torch wheel downloaded
regrading the cuda check with
nvcc
, I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvidia-smi interface, worth checking though ...
Ok, but when nvcc
is not ava...
There it is: https://github.com/allegroai/clearml/issues/493
AppetizingMouse58 After some thoughts, we decided to install from scratch 0.16, with no data migration, because we believe this was an edge case not worth spending efforts on. Thank you very much for your help there, very appreciated. You guys rock! π
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
-> wrong numpy version
that would work for pytorch and clearml yes, but what about my local package?
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb π€ The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
sorry, the clearml-session. The error is the one I shared at the beginning of this thread
Sure! Here are the relevant parts:
` ...
Current configuration (clearml_agent v1.2.3, location: /tmp/.clearml_agent.3m6hdm1_.cfg):
...
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 ...
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d
afterwards, and not docker-compose restart
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact()
in the last step task
ok, what is your problem then?
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
?
Could you please share the stacktrace?
This https://discuss.elastic.co/t/index-size-explodes-after-split/150692 seems to say for the _split API such situation happens and solves itself after a couple fo days, maybe the same case for me?
Thanks! I would like to use this opportunity to split the indices into multiple shards, as explained here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-split-index.html#indices-split-index
Ok, I got the following error when uploading the table as an artifact:ValueError('Task object can only be updated if created or in_progress')
So the problem comes when I domy_task.output_uri = "
s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds