Why would a TensorFlow restored checkpoint run out of memory but the original script wouldn't?

530
February 12, 2017, at 12:27 PM

I am running some TensorFlow code that restores and re-starts training from a checkpoint. Whenever I restore from a cpu build it seems to work perfectly fine. But if I try to restore when I run my code with gpu it seems to not work. In particular I get the error:

Traceback (most recent call last):
  File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
    large_main_hp.main_large_hp_ckpt(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
    run_hyperparam_search(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
    main_hp.main_hp(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
    with tf.Session(graph=graph) as sess:
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615

I see that it says I am running out of memory, but when I increase the memory to say 10GBs it doesn't really change anything. This only happens with my gpu build because the cpu one restores perfectly fine.

Anyway have any idea or starting ideas of what might be causing this?

The gpu's are essentially assigned automatically so I'm not quite sure what might be causing it or what are the starting steps to even debug this.

full error:

E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
  File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
    large_main_hp.main_large_hp_ckpt(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
    run_hyperparam_search(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
    main_hp.main_hp(arg)
  File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
    with tf.Session(graph=graph) as sess:
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
Answer 1

Tensorflow CPU use benefits from both physical and virtual memory giving you almost unlimited memory to manipulate your models. Your first step in debugging is to build a smaller model by simply removing some weights/layers and running on the GPU to ensure you have no coding errors. Then slowly increase the layers/weights until the you run out of memory again. This will confirm that you have a memory issue on the GPU. I would recommend initially building your graph on the GPU that way you know it will fit later when you train on it. If you need the large graph then consider allocating parts of the graph to different GPU if you have them.

READ ALSO
Even when I convert an str to int it still says it&#39;s a str

Even when I convert an str to int it still says it's a str

I am trying to make a port scanner where the user can type a range of ports to scan on a host, I convert the input from str to int for range but it still says it's an strHere is my code:

299
I want to replace single quotes with double quotes in a list

I want to replace single quotes with double quotes in a list

So I am making a program that takes a text file, breaks it into words, then writes the list to a new text file

621
python equivalent of curl -X POST -F &ldquo;name=blahblah&rdquo; -F &ldquo;file=@blahblah.jpg&rdquo;

python equivalent of curl -X POST -F “name=blahblah” -F “file=@blahblah.jpg”

Can someone please help me with the python equivalent of the curl command:

316