Reputation
Badges 1
606 × Eureka!So with pipeline decorators can I implement this logic?
AgitatedDove14 Yes, you understood correctly. But Task.create
is used by Task.init
something like this, right?
` def init(project_name, task_name):
if not Task.exists_already(project_name, task_name):
task = Task.create(...)
else:
task = load_existing_task()
return task `
I am going to try it again and send you the relevant part of the logs in a minute. Maybe I am interpreting something wrong.
I guess it started with the usage of the cleanup_service.
Maybe related question: Will there be some documentation about clearml internals with the new documentation? ClearML seems to store stuff that's relevant to script execution outside of clearml.Task if I am not mistaken. I would like to learn a little bit about what the code structure / internal mechanism is.
Could you guide me to the documentation for using the docker file? I am not able to find it. I only found task.set_base_docker
which I am not sure what it does.
I restarted it after I got the errors, because as everyone knows, turning it off and on usually works 😄
The agent is run with pip. However, the docker image uses conda (because NVIDIA uses conda to build PyTorch most probably). My theory is that when the task is run the first time on an agent, Task.init will update the requirements. Then when ran a second time, the task will contain the requirements of the (conda-) environment from the first run.
Here it is
Or alternatively I just saw that Task.create
takes a requirements.txt
as an argument. This would also be fine for me, however I am not sure whether I should use Task.create
?
Yes, but this seems pretty reasonable to assume imo.
Perfect, thanks! Only issue that is left, is that it seems like .ssh
is used even when I provideSSH_AUTH_SOCK
. I created an issue here: https://github.com/allegroai/clearml-agent/issues/45
Thanks, that makes sense. Can you also explain what task_log_buffer_capacity
does?
Unfortunately, not. Quick question: Is there caching happening somewhere besides .clearml
? Does the boto3 driver create cache?
I have a related question: I read here that 4GB is a http limitation and ClearML will not chunk single files. I take from that, that ClearML did not want/there was no need to implement an own solution so far. But what about models that are larger than 4GB?
Then I could also do this:# My custom very special use case task = Task() task = task.load_statedict(await Task.load_or_create(task_name)) await task.synchronize() await run_code_analysis() task.add_requirement("myreq") await task.synchronize()
Yes, I do not want to rely on the clearml-agent. Afaik the clearml-sdk in the container does the downloading and since a host directory is mounted, it is mirrored there. If it was possible to not mount the host directory, everything would be contained 🙂
Thank you for clearing that up 🙂
I will read up on the services documentation then. Thank you very much for the help 🙂
One question: Does clearml resolve the CUDA Version from driver or conda?
Okay, I didn't know that. I just saw that VSCode seems to use a similar setup for their docker devcontainers.
` =============
== PyTorch ==
NVIDIA Release 22.03 (build 33569136)
PyTorch Version 1.12.0a0+2c916ef ...
Looking in indexes: ,
Requirement already satisfied: pip in /root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages (22.0.4)
2022-04-07 16:40:57
Looking in indexes: ,
Requirement already satisfied: Cython in /opt/conda/lib/python3.8/site-packages (0.29.28)
Looking in indexes: ,
Requirement already satisfied: numpy==1.22.3 in /opt/conda/...
I see, I just checked the logs and it showsurllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f246f0d6c18>: Failed to establish a new connection: [Errno 111] Connection refused [2022-04-29 08:45:55,018] [9] [WARNING] [elasticsearch] POST
[status:N/A request:0.000s]
Unfortunetely, there are no logs in /usr/share/elasticsearch/logs
to see what elastic was up to