Do I set the
CLEARML_FILES_HOST
to the end point instead of an s3 bucket?
Yes you are right this is not straight forward:CLEARML_FILES_HOST="
s3://minio_ip:9001 "
Notice you must specify "port" , this is how it knows this is not AWS. I would avoid using an IP and register the minio as a host on your local DNS / firewall. This way if you change the IP the links will not get broken 🙂
Many thanks! I'll pass on to technical writers 🙂
Hi HelplessCrocodile8
yes there is:
in the first case, the new_key
will be automatically logged:a_dict = {} a_dict = task.connect(a_dict) a_dict['new_key'] = 42
In the second example changes to the "object" passed to connect are not tracked
make sense ?
Hi RipeGoose2
Any logs on the console ?
Could you test with a dummy example on the demoserver ?
Hi @<1533620191232004096:profile|NuttyLobster9>
I, but no system stats. ,,,
If the job is too short (I think 30 seconds), it doesn't have enough time to collect stats (basically it collects them over a 30 sec window, but the task ends before it sends them)
does that make sense ?
Where again does clearml place the venv?
Usually ~/.clearml/venvs-builds/<python version>/
Multiple agents will be venvs-builds.1
and so on
yes, so it does exist the local process (at least, the command returns),
What do you mean the command returns ? are running the scipt from bash and it returns to bash ?
ShallowGoldfish8 the models are uploaded in the background, task.close() is actually waiting for them, but wait_for_upload is also a good solution.
where it seems to be waiting for the metrics, etc but never finishes. No retry message is shown as well.
From the description it sounds like there is a problem with sending the metrics?! the task.close
is waiting for all the metrics to be sent, and it seems like for some reason they are not, and this is why close is waiting on them
A...
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
I can install pytorch just fine locally on the agent, when I do not use clearml(-agent)
My thinking is the issue might be on the env file we are passing to conda, I can't find any other diff.
BTW:
@<1523701868901961728:profile|ReassuredTiger98> Can I send a specific wheel with mode debug prints for you to check (basically it will print the conda env YAML it is using)?
Could it be you have old OS environment overriding the configuration file ?
Can you change the IP of the server in the conf file, and make sure it has an effect (i.e. the error changed)?
Hmm what do you have here?
os.system("cat /var/log/studio/kernel_gateway.log")
Hmm let check again something.
Hi DepressedChimpanzee34
I think main issue here is slow response time from the API server, I "think" you can increase the number of API server processes, but considering the 16GB, I'm not sure you have the headroom.
At peak usage, how much free RAM so you have on the machine ?
What do you mean cache files ? Cache is machine specific and is set in the clearml.conf file.
Artifacts / models are uploaded to the files server (or any other object storage solution)
it was uploading fine for most of the day
What do you mean by uploading fine most of the day ? are you suggesting the upload stuck to the GS ? are you seeing the other metrics (scalars console logs etc) ?
RoughTiger69 I think this could work, a pseudo example:
` @PipelineDecorator.component(...)
def the_last_step_before_external_stuff():
print("doing some stuff")
@PipelineDecorator.pipeline()
def logic():
the_last_step_before_external_stuff()
if not check_if_data_was_ingested_to_the_system:
print("aborting ourselves")
Task.current_task().abort()
# we will not get here, the agent will make sure we are stopped
sleep(60)
# better safe than sorry
exit(0) `wdyt? (the...
Hi JitteryCoyote63 you can bus obviously you should be careful they might both try to allocate more GPU memory than they the HW actually has.TRAINS_WORKER_NAME=machine_gpu0A trains-agent daemon --gpus 0 --queue default --detached TRAINS_WORKER_NAME=machine_gpu0B trains-agent daemon --gpus 0 --queue default --detached
LOL EnormousWorm79 you should have a "do not show again" option, no?
And maybe adding idle time spent without a job to API is not that a bad idea 😉
yes, adding that to the feature list 🙂
What if I write the last active state in an instance tag? This could be a solution…
I love this hack, yes this should just work.
BTW: if you lambda is a for loop that is constantly checking there is no need to actually store "last idle timestamp check as tag", no?
This really makes little sense to me...
Can you send the full clearml-session --verbose console output ?
Something is not working as it should obviously, console output will be a good starting point
Can you do the following
Clone the Task you previously sent me the installed packages of, then enqueue the cloned task to the queue the agent with the conda.
Then send me the full log of the task that the agent run