Reputation
Badges 1
981 × Eureka!yes, exactly: I run python my_script.py , the script executes, creates the task, calls task.remote_execute(exit_process=True) and returns to bash. Then, in the bash console, after some time, I see some messages being logged from clearml
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print
You are right, thanks! I was trying to move /opt/trains/data to an external disk, mounted at /data
And since I ran the task locally with python3.9, it used that version in the docker container
and in the logs:
`
agent.worker_name = worker1
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /...
CostlyOstrich36 , this also happens with clearml-agent 1.1.1 on a aws instance…
AgitatedDove14 That's a good point: The experiment failing with this error does show the correct aws key:... sdk.aws.s3.key = ***** sdk.aws.s3.region = ...
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far 🙂
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
There is an example in the https://github.com/allegroai/clearml/blob/master/docs/datasets.md#workflow section of the linked I shared above
btw I monkey patched ignite’s function global_step_from_engine to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine)) . It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
...
AgitatedDove14 any chance you found something interesting? 🙂
ClearML has a task.set_initial_iteration , I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task
Well no luck - using matplotlib.use('agg') in my training codebase doesn't solve the mem leak
but not as much as the ELB reports
Actually was not related to clearml, the higher level error causing this one was (somewhere in the stack trace): RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd -> wrong numpy version
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
mmmh it fails, but if I connect to the instance and execute ulimit -n , I do see65535while the tasks I send to this agent fail with:OSError: [Errno 24] Too many open files: '/root/.commons/images/aserfgh.png'and from the task itself, I run:import subprocess print(subprocess.check_output("ulimit -n", shell=True))Which gives me in the logs of the task:b'1024'So nnofiles is still 1024, the default value, but not when I ssh, damn. Maybe rebooting would work
Could you please point me to the relevant component? I am not familiar with typescript unfortunately 😞
it would be nice if Task.connect_configuration could support custom yaml file readers for me
Trying now your code… should take a couple of mins
It failed as well
AgitatedDove14 I cannot confirm at 100%, the context is different (see previous messages) but it could be the same bug behind the scene...
SuccessfulKoala55 Thanks to that I was able to identify the most expensive experiments. How can I count the number of documents for a specific series? Ie. I suspect that the loss, that is logged every iteration, is responsible for most of the documents logged, and I want to make sure of that
The parent task is a data_processing task, therefore I retrieve it so that I can then data_processed = parent_task.artifacts["data_processed"]
Hi AgitatedDove14 , that’s super exciting news! 🤩 🚀
Regarding the two outstanding points:
In my case, I’d maintain a client python package that takes care of the pre/post processing of each request, so that I only send the raw data to the inference service and I post process the raw output of the model returned by the inference service. But I understand why it might be desirable for the users to have these steps happening on the server. What is challenging in this context? Defining how t...
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process
From my experience, I only installed cuda drivers on my machines. I didn't used conda to install torch nor cudatoolkit, I just let clearml-agent download the torch wheel file and install it