LittleShrimp86 did you try to run the pipeline form the UI on remote machines (i.e. with the agents)? Did that work?
For reporting the console logs you can use :logger.report_text("my log line here", print_console=False)
https://github.com/allegroai/clearml/blob/b4942321340563724bc16f60ea5dd78c9161778d/clearml/logger.py#L120
Notice you should be able to override them in the UI (under Args seciton)
Wow, thank you very much. And how would I bind my code to task?
you mean the code that creates pipeline Tasks ?
(remember the pipeline itself is a Task in the system, basically if your pipeline code is a single script it will pack the entire thing )
Try removing this magic environment that tells the sub-process there was already an Initialized Task.
import os env = dict(**os.environ) env.pop('TRAINS_PROC_MASTER_ID', None)
🙂
print(requests.get(url='
print(requests.get(url='
Hi WittyOwl57
Are you starting a new server from scratch or is it running on previously stored data?
Hi SubstantialElk6
Generally speaking here, the idea is that actual code creates a Dataset (i.e. Dataset class created from code), plus you can add some metric reporting (like table reporting) to create a preview of the data stored for better visibility, or maybe create some statistics as part of the data ingest script. Then this ingest code can be relaunched / automated. The created Dataset itself can be tagged renamed added key/value for better cataloging. wdyt?
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
It's my bad, after that inside the container it does cp -Rf /.ssh ~/.ssh
The reason is that you cannot know the user home folder before spinning the container
Anyhow the point is, are you sure that you have ~/.ssh on the Host machine configured?
And if you do, are you saying this is part of your AMI? if not how did you put it there?
See the log:
Collecting keras-contrib==2.0.8
File was already downloaded c:\users\mateus.ca\.clearml\pip-download-cache\cu0\keras_contrib-2.0.8-py3-none-any.whl
so it did download it, but it failed to pass it correctly ?!
Can you try with clearml-agent==1.5.3rc2
?
SmarmySeaurchin8 checks the logs, maybe you can find something there
RobustGoldfish9 do you see the trains-agent listed as a machine in the UI (under workers)
JitteryCoyote63 you mean from code?
Can you put here the task.connect line ? (btw: I would assume there is no need for additional connect, if using hydra+fire, no ?)
PompousHawk82 what do you mean by ?
but the thing is that i can only use master to log everything
We could use our 8xA100 as 8 workers, for 8 single-gpu jobs running faster than on a single 1xV100 each.
@<1546665634195050496:profile|SolidGoose91> I think that in order to have the flexibility there you need the "dynamic" GPU allocation that is only part of the "enterprise" offering 😞
That said, why not allocate a single a100 machine? no?
that really depends on hoe much data you have there, and the setup. The upside of the file server is you do not need to worry about credentials, the downside is storage is more expensive
any chance StorageManager could re-download files only if their size is different from file in cache (as an option)?
I think there is force
argument, to force download.
I think the main issue is getting the size from different backends (i.e. s3 /https / etc.)
Maybe we should add it as a GitHub feature request issue?
The main limitation is that the driver "list()" does not return file size.
For example it might be an issue with the default http files-server.
wdyt?
is it a shared network mount ? could you just delete the entire ~/.clearml on the host machine ?
Scheduled training is what I’m looking forward to
I'll try to remember to update here once we pushed into the GitHub repo, feedback is always appropriated 🙂
If in the next two weeks you hear nothing, please ping here to make sure I did not forget 😉
VirtuousFish83 I remember an issue on github with something similar , what's the cleamrl- server version you are using ?
@<1523701066867150848:profile|JitteryCoyote63>
I just created a new venv and run
pip install "torch==1.11.0.*" --extra-index-url
Then started python:
import torch
torch.cuda.is_available()
And I get True
what are you getting?
DilapidatedDucks58 I'm assuming clearml-server 1.7 ?
I think both are fixed in 1.8 (due to be released wither next week, or the one after)
StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?
Should work out of the box, as long as the task was started. You can forcefully start the task with:task.mark_started()
Hi @<1573119962950668288:profile|ObliviousSealion5>
Hello, I don't really like the idea of providing my own github credentials to the ClearML agent. We have a local ClearML deployment.
if you own the agent, that should not be an issue,, no?
forward my SSH credentials using
ssh -A
and then starting the clearml agent?
When you are running the agent and you force git clonening with SSH, it will autmatically map the .ssh into the container for the git to use
Ba...