
Reputation
Badges 1
979 × Eureka!This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem π
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
I found, the filter actually has to be an iterable:Task.get_tasks(project_name="my-project", task_name="my-task", task_filter=dict(type=["training"])))
no it doesn't! 3. They select any point that is an improvement over time
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there π )
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
I see 3 agents in the "Workers" tab
yes, in setup.py I have:..., install_requires= [ "my-private-dep @ git+
", ... ], ...
Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR π
I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent
I am doing:try: score = get_score_for_task(subtask) except: score = pd.NA finally: df_scores = df_scores.append(dict(task=subtask.id, score=score, ignore_index=True) task.upload_artifact("metric_summary", df_scores)
It could be: I am running the clearml aws autoscaler in an ec2 instance having iam roles allowing for creating/deleting instances, but I get Warning! exception occurred: An error occurred (UnauthorizedOperation) when calling the RunInstances operation: You are not authorized to perform this operation. Encoded authorization failure message: ...
I suspect that since the agent is running in docker mode, the boto3 lib doesnβt automatically get the right permissions from the ec2-instance. To...
I had this problem before
Nevermind, I just saw report_matplotlib_figure
π
Ok so the problem was indeed the way docker was installed (with snap)
in the UI the value is correct one (not empty, a string)
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb π€ The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
So there will be no concurrent cached files access in the cache dir?
File "devops/valid.py", line 80, in valid(parse_args) File "devops/valid.py", line 41, in valid valid_task.output_uri = args.artifacts File "/data/.trains/venvs-builds/3.6/lib/python3.6/site-packages/trains/task.py", line 695, in output_uri ", check configuration file ~/trains.conf".format(value)) ValueError: Could not get access credentials for 's3://ml-artefacts' , check configuration file ~/trains.conf
So in my minimal reproducable example, it does work π€£ very frustrating, I will continue searching for that nasty bug
I reindexed only the logs to a new index afterwards, I am now doing the same with the metrics since they cannot be displayed in the UI because of their wrong dynamic mappings
The host is accessible, I can ping it and even run curl "
http://internal-aws-host-name:9200/_cat/shards "
and get results from the local machine
Is there one?
No, I rather wanted to understand how it worked behind the scene π
The latest RC (0.17.5rc6) moved all logs into separate subprocess to improve speed with pytorch dataloaders
Thatβs awesome!
Yes AgitatedDove14 π
That's why I suspected trains was installing a different version that the one I expected
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
Would adding a ILM (index lifecycle management) be an appropriate solution?
Answering myself: Yes, Task.set_base_docker
RTFM!!!