Reputation
Badges 1
25 × Eureka!Hmm should not make a diff.
Could you verify it still doesn't work with TF 2.4 ?
I think you are correct π Let me make sure we add that (docstring and documentation)
hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?
Hi EagerOtter28
I think the replacement should happen here:
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/clearml_agent/helper/repo.py#L277
Hi SuperficialGrasshopper36
You are diffidently onto a bug π
It seems that with the new poetry , we fail to set the target venv (basically it decides for itself), from that point, the execution f the actual code is not running inside the correct venv.
Could you please open a GitHub issue?
I want to make sure this will be addressed π
Hi ExcitedFish86
Good question, how do you "connect" the 3 nodes? (i.e. what the framework you are using)
Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L...
Ohh, I see now, yes that should be fixed as well π
JitteryCoyote63 nice hack π
how come it is not automatically logged as console output ?
Hmm let me rerun (offline mode right ?)
Any plans to add unpublished stateΒ for clearml-serving?
Hmm OddShrimp85 do you mean like flag, not being served ?
Should we use archive ?
The publish state, basically locks the Task/Model so they are not to be changed, should we enable unlocking (i.e. un-publish), wdyt?
Yes, only task.execute_remotely() should be the last call. because it literally will stop the local run before you add the Args section
Thanks BroadSeaturtle49
I think I was able to locate the issue != breaks the pytroch lookup
I will make sure we fix asap and release an RC.
BTW: how come 0.13.x have No linux x64 support? and the same for 0.12.x
https://download.pytorch.org/whl/cu111/torch_stable.html
BTW: you can always set different config files by with an environment variable:CLEARML_CONFIG_FILE="path/to/cobfig/file
Hmm can you try:--args overrides="['log.clearml=True','train.epochs=200','clearml.save=True']"
neat! please update on your progress, maybe we should add an upgrade section once you have the details worked out
it seems like each task is setup to run on a single pod/node based on the attributes like
gpu memory
,
os
,
num of cores,
worker
BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and b...
ElegantCoyote26parser = get_parser() args_ = vars(parser.parse_args()) task.connect(args_)There is no need to connect args_ Task.init will automatically catch the argparser.
OK, I got it by modifying the .conf file and putting the credentials on node
Nice! π
. Yes I do have a GOOGLE_APPLICATION_CREDENTIALS environment variable set, but nowhere do we save anything to GCS. The only usage is in the code which reads from BigQuery
Are you certain you have no artifacts on GS?
Are you saying that if GOOGLE_APPLICATION_CREDENTIALS and clearml.conf contains no "project" section it crashed when starting ?
task = Task.get_task('task_id_here') task.mark_started(force=True) task.upload_artifact(..., wait_on_upload=True) task.mark_completed()
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
That would match what
add_dataset_trigger
and
add_model_trigger
already have so it would be good
Sounds good, any chance you can open a github issue, so that we do not forget?
Another parameter for when the task is deleted might also be useful
That actually might be more complicated, because there might be a race condition, basically missing the delete operation...
What would be the use case?
I might gave an idea, could you test with:
` from clearml import Task
Task._report_subprocess_enabled = False
...
real code here `
This seems to only work for a single file (weights_path implies a single file, not multiple ones). Is that the case?See update_weights_package actually packages an entire folder as zip and will do the extraction when you get it back (check the function docstring, I think you can also specify wildcard etc if needed)
Why do you see this as preferred to the dataset method we have now?
So it answers a few requirements that you raised
It is fully visible as part of the project and se...
Hmm I think you have a point here, the confusing part is the cp cmd. Can you send the full log? (Regradless , can I assume you are running a rootless container ?)