Reputation
Badges 1
981 × Eureka!basically:
` from trains import Task
task = Task.init("test", "test", "controller")
task.upload_artifact("test-artifact", dict(foo="bar"))
cloned_task = Task.clone(task, name="test", parent=task.task_id)
cloned_task.data.script.entry_point = "test_task_b.py"
cloned_task._update_script(cloned_task.data.script)
cloned_task.set_parameters(**{"artifact_name": "test-artifact"})
Task.enqueue(cloned_task, queue_name="default") `
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
This is the mapping of the faulty index:
` {
"events-plot-d1bd92a3b039400cbafc60a7a5b1e52b_new" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"@timestamp" : {
"type" : "date"
},
"iter" : {
"type" : "long"
},
"metric" : {
"type" : "keyword"
},
"plot_data" : {
"type" : "binary"
},
"plot_len" : {
"type" : "long"
},
"plot_str" : {
...
Both ^^, I already adapted the code for GCP and I was planning to adapt to Azure now
CostlyOstrich36 yes, when I scroll up, a new events.get_task_log is fired and the response doesn’t contain any log (but it should)
I just move one experiment in another project, after moving it I am taken to the new project where the layout is then reset
To be fully transparent, I did a manual reindexing of the whole ES DB one year ago after it run out of space, at that point I might have changed the mapping to strict, but I am not sure. Could you please confirm that the mapping is correct?
AgitatedDove14 So I’ll just replace task = clearml.Task.get_task(clearml.config.get_remote_task_id()) with Task.init() and wait for your fix 🙂
Does what you suggested here > https://github.com/allegroai/trains-agent/issues/18#issuecomment-634551232 also applies for containers used by the services queue?
AgitatedDove14 I eventually found a different way of achieving what I needed
Here is the minimal reproducable example.
Run test_task_a.py - It will register a dummy artifact, create a new task, set a parameter in that task and enqueue it test_task_b will try to retrieve parameter from parent task and fail
I have a custom way of reading the config file
RobustRat47 It can also simply be that the instance type you declared is not available in the zone you defined
Ok so the problem was indeed the way docker was installed (with snap)
I’d like to move to a setup where I don’t need these tricks
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
But that was too complicated, I found an easier approach
Ok AgitatedDove14 SuccessfulKoala55 I made some progress in my investigation:
I can exactly pinpoint the change that introduced the bug, it is the one changing the endpoint "events.get_task_log", min_version="2.9"
In the firefox console > Network, I can edit an events.get_task_log and change the URL from …/api/v2.9/events.get_task_log to …/api/v2.8/events.get_task_log (to use the endpoint "events.get_task_log", min_version="1.7" ) and then all the logs are ...
yes, in setup.py I have:..., install_requires= [ "my-private-dep @ git+ ", ... ], ...
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
ok, so there is no way to cache it and detect when the ref changes?
even if I move the Github workers internally where they could have access to the prod server, I am not sure I would like that, because it would pile up test data in the prod server that is not necessary
I can ssh into the agent and:source /trains-agent-venv/bin/activate (trains_agent_venv) pip show pyjwt Version: 1.7.1
with the CLI, on a conda env located in /data
did you try with another availability zone?
