Reputation
Badges 1
979 × Eureka!SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
After I started clearml-session
I execute the clearml-agent this way:/home/machine/miniconda3/envs/py36/bin/python3 /home/machine/miniconda3/envs/py36/bin/clearml-agent daemon --services-mode --cpu-only --queue services --create-queue --log-level DEBUG --detached
might be worth documenting 😄
Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested i...
Hi SuccessfulKoala55 , not really wrong, rather I don't understand it, the docker image with the args after it
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
In execution tab, I see old commit, in logs, I see an empty branch and the old commit
AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didn’t
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false
I need to call task.set_initial_iteration(0)
I also tried task.set_initial_iteration(-task.data.last_iteration)
, hoping it would counteract the bug, didn’t work
well I still see some ES errors in the logs
` clearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not...
but not as much as the ELB reports
Any chance this is reproducible ?
Unfortunately not at the moment, I could find a reproducible scenario. If I clone a task that was stuck and start it, it might not be stuck
How many processes do you see running (i.e. ps -Af | grep python) ?
I will check that when the next one will be blocked 👍
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
I train with p...
SuccessfulKoala55 Thanks to that I was able to identify the most expensive experiments. How can I count the number of documents for a specific series? Ie. I suspect that the loss, that is logged every iteration, is responsible for most of the documents logged, and I want to make sure of that
Here I have to do it for each task, is there a way to do it for all tasks at once?
But I would need to reindex everything right? Is that a expensive operation?
I should also rename /opt/trains/data/elastic_migrated_2020-08-11_15-27-05
folder to /opt/trains/data/elastic
before running the migration tool right?
The part where I'm lost is why would you need the path to the temp venv the agent creates/uses ?
let's say my task is calling a bash script, and that bash script is calling another python program, I want that last python program to be executed with the environment that was created by the agent for this specific task
AgitatedDove14 I eventually found a different way of achieving what I needed
But that was too complicated, I found an easier approach
CostlyOstrich36 good enough, I will fallback to sorting by updated, thanks!
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init
with Task.get_task
so that Task.current_task
is the same task as the output of Task.get_task
Is there any logic on the server side that could change the iteration number?