Reputation
Badges 1
25 × Eureka!Hmm what do you have here?
os.system("cat /var/log/studio/kernel_gateway.log")
This is very odd ... let me check something
This is strange, let me see if we can get around it, because I'm sure it worked π
Which works for my purposes. Not sure if there's a good way to automate it
Interesting, so if we bind to hydra.compose
it should solve the issue (and of course verify we are running on a jupyter notebook)
wdyt?
Hmm and how would you imagine a transparent integration here (the example looks like a lot of boilerplate code...)
Hmm, this is a good question, I "think" the easiest is to mount the .ssh folder form the host to the container itself. Then also mount clearml.conf into the container with force_git_ssh_protocol: true
see here
https://github.com/allegroai/clearml-agent/blob/6c5087e425bcc9911c78751e2a6ae3e1c0640180/docs/clearml.conf#L25
btw: ssh credentials even though sound more secure are usually less (since they easily contain too broad credentials and other access rights), just my 2 cents π I ...
Hi WackyRabbit7
the services
(or the agent running there) is spinning multiple Tasks (as opposed to regular agent where it is one task at a time).
how can I give this agent git access?
in the docker-compose you can configure the git credentials (user/pass or user/key it is the same).
https://github.com/allegroai/clearml-server/blob/d0e2313a24eb1248ebf0ddf31bf589de0d675562/docker/docker-compose.yml#L137
Yes. Though again, just highlighting the naming of
foo-mod
is arbitrary. The actual module simply has a folder structured with an implicit namespace:
Yep I think this is exactly why it fails detecting it, let me check that
And itβs failing on typing hints for functions passed in
pipe.add_function_step(β¦, helper_function=[β¦])
β¦ I guess those arenβt being removed like the wrapped function step?
Can you provide the log? I think I'm missing what e...
SmugOx94 Yes, we just introduced it π with 0.16.3
Discussion was here (I'll make sure to update the issue that the version is out)
https://github.com/allegroai/trains/issues/222
In your trains.conf
add the following line:sdk.development.store_code_diff_from_remote = true
It will store the diff from the remote HEAD instead of the local one.
GiganticTurtle0 is it in the same repository ?
If it is it should have detected the fact that it needs to analyze the entire repository (not just the standalone script, and then discover tensorflow)
Hi GiganticTurtle0
ClearML will only list the directly imported packaged (not their requirements), meaning in your case it will only list "tf_funcs" (which you imported).
But I do not think there is a package named "tf_funcs" right ?
, it's just a custom module.
Is this your own module ? Is this a local folder we import from ?
Correct.
It starts with the initial script (entry point), if it is self contained (i.e. does not interact with the rest of the repo) it will only analyze it, otherwise it will analyze the entire repo code.
Hi DilapidatedDucks58
trains-agent tries to resolvethe torch package based on the specific cuda version inside the docker (or on the host machine is if used in virtual-env mode). It seems to fail finding the specific version "torch==1.6.0.dev20200421+cu101"
I assume this version was automatically detected by trains when running manually. If this version came from a private artifactory you can add it to the trains.conf https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L...
In theory task.tags.remove(tag)
might also work, but I'm not sure of it will automatically be updated on the backend
Yes, I think you are correct, verified on Firefox & Chrome. I'll make sure to pass it along.
Thanks SteadyFox10 !
@<1533620191232004096:profile|NuttyLobster9> I think we found the issue, when you are passing a direct link to the python venv, the agent fails to detect the python version and since the python version is required for fetching the correct torch it fails to install it. This is why passing CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none
because it skipped resolving the torch / cuda version (that requires parsing the python version)
Hi WackyRabbit7
So I'm assuming after the start_locally
is called ?
Which clearml version are you using ?
(just making sure, calling Task.current_task()
before starting the pipeline returns the correct Task?)
Exactly !
it seems like each task is setup to run on a single pod/node based on the attributes like
gpu memory
,
os
,
num of cores,
worker
BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and b...
Actually this is by default for any multi node training framework torch DDP / openmpi etc.
Thanks ShallowCat10 !
I'll make sure we fix it π
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Please notice that the clearml serving is not designed for public exposure, it lacks security layer, and is designed for easy internal deployment. If you feel you need the extra security layer I sugget either add external JWT alike authentication, or talk to the clearml people, their paid tiers include enterprise grade security on top
Well I guess you can say this is definitely not self explanatory line π
but, it is actually asking whether we should extract the code, think of it as:if extract_archive and cached_file: return cls._extract_to_cache(cached_file, name)
are you referring to the same line? 47 in cache.py?
Legit, if you have a cached_file (i.e. exists and accessible), you can return it to the caller
We should probably change it so it is more human readable π