Reputation
Badges 1
25 × Eureka!that is because my own machine has 10.2 (not the docker, the machine the agent is on)
No that has nothing to do with it, the CUDA is inside the container. I'm referring to this image https://allegroai-trains.slack.com/archives/CTK20V944/p1593440299094400?thread_ts=1593437149.089400&cid=CTK20V944
Assuming this is the output from your code running inside the docker , it points to cuda version 10.2
Am I missing something ?
Hi FierceFly22
You called execute_remotely a bit too soon. If you have any manual configuration, they have to be called before, so they are stored in the Task. This includes task.connect and task.connct_configuration.
FYI: if you need to query stuff you can always look directly in the RestAPI:
https://github.com/allegroai/clearml/blob/master/clearml/backend_api/services/v2_9/projects.py
https://allegro.ai/clearml/docs/rst/references/clearml_api_ref/index.html
Hi ReassuredTiger98
I think DefiantCrab67 solved it 🙂
https://clearml.slack.com/archives/CTK20V944/p1617746462341100?thread_ts=1617703517.320700&cid=CTK20V944
PompousParrot44 these are the default plotly colors. You can change any of the layout properties with the
https://github.com/allegroai/trains/blob/65a4aa7aa90fc867993cf0d5e36c214e6c044270/trains/logger.py#L600
what do you have here in your docker compose :
None
CheerfulGorilla72 as I understand there were some delays wit the current release, so it is going to be out this week. The one after that includes this feature and as far as I understand would be mid Dec.
The reason is because it is logged as an image, not a plot 🙂
@<1523702932069945344:profile|CheerfulGorilla72> use the following bucket name when you are configuring your files/output uri
s3://<iphere>:<porthere>/<bucket_here>
From there everything should work as expected
What's your clearml version (python and server) ?
It seems that once the job as completed once, it doesn't accept any new report...
completed can be forced, published cannot ...
What's the error you are getting ?
K8s + clearml-agent integration.
Hmm is this an on-prem k8s cluster?
Oh if this is the case you can probably do
` import os
import subprocess
from clearml import Task
from clearml.backend_api.session.client import APIClient
client = APIClient()
queue_ids = client.queues.get_all(name="queue_name_here")
while True:
result = client.queues.get_next_task(queue=queue_ids[0].id)
if not result or not result.entry:
sleep(5)
continue
task_id = result.entry.task
client.tasks.started(task=task_id)
env = dict(**os.environ)
env['CLEARML_TASK_ID'] = ta...
Hi GiddyTurkey39
First, yes you can just edit the "installed packages" section and add any missing package (this is equal to requirements.txt)
I wonder why trains
failed detecting the "bigquery" package in the first place... Any thoughts ?
Hi @<1684010629741940736:profile|NonsensicalSparrow35>
however for the remote file it always creates the name with the following pattern:
{filename_prefix}checkpoint{n}.pt
..
Is this the main issue?
Notice that the model name (i.e. the entry on the Task itself) is not directly connected with the stored file name on the target file server (or S3)
TrickyRaccoon92 the title
provided by write.scalars is also a representing string for the specific metric. This is more than just a title on the plot itself.
It means that this will be the name of the scalar metric (title/series combination) .
Is that your intention, or is it for viewing purpose only?
DistressedGoat23 you are correct, since at the end this become a plotly object the extra_layout is for general purpose layout, but this specific entry is next to the data. Bottom line, can you open a github issue, so we do not forget to fix? In the mean time you can use the general plotly reporting as SweetBadger76 suggested
Thanks for the detials @<1597762318140182528:profile|EnchantingPenguin77>
clearml.Auto-Scaler - INFO - New instance b97e702d-e2b3-4f28-adab-be59648601ea listening to test-gpu queue
This looks like a new agent was spined on your EC2 account, can you see it in the "Workers" page ?
Hi @<1634001106403069952:profile|DefeatedMole42>
This points to the pipeline component failing to execute (i.e. the Task of the component Failed)
Can you send the log of that Task?
Okay, now I'm lost, is this reproducible ? are you saying Dataset with remote links to S3 does not work?
Did you provide credntials to your S3 (in tour clear.conf) ?
I mean just add the toy tqdm loop somewhere just before starting the lightning train function. I just want to verify that it works, or maybe there is something in the specific setup happening in real-time that changes it
Hi @<1541592204353474560:profile|GhastlySeaurchin98>
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
For example, opening a project or experiment page might take half a minute.
This implies mongodb performance issue
What's the size of the mongo DB?
If I access the dataset on the same location directly it works fine:
wait, I'm confused, how is it the datset us there? did it download the dataset?
are you saying this line for example will fail? (assuming you actually have a dataset by that name)
data_path = Dataset.get(dataset_name="002_Datenset_MASAM_for_fintuning", alias="002_Datenset_MASAM_for_fintuning").get_local_copy()
Ohh SubstantialElk6 please use agent RC3, (latest RC is somewhat broken sorry, we will pull it out)
Hi JealousParrot68
spinning the clearml-agent with docker support (i.e. each experiment is running inside its own container):
https://clear.ml/docs/latest/docs/clearml_agent#docker-mode
Basically you can specify a default docker to use (per agent) and a specific docker container to use per Task (configured in the UI under execution at the bottom)
Very Cool!
BTW guys, are you using the task.models[]
to continue from the last checkpoint? or is it task.artifacts[]
?