|
From: | Patrick J McNerthney |
Subject: | Re: [Fab-user] Trouble using fabric with EC2 |
Date: | Mon, 14 Jun 2010 12:38:03 -1000 |
User-agent: | Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100423 Thunderbird/3.0.4 |
Matt,I use Fabric to orchestrate by EC2 instances also. What I did though was to create a loop that tests for "ssh connectability" before I invoke Fabric scripts. Very roughly copying and pasting the code, it looks something like this:
# The instance state is "running" before entering this loop. while True: time.sleep(1) self.update() # This updates self.instance.state if self.instance.state != "running":raise Exception('Unexpected instance state "' + self.instance.state + '"')
if self._test_ssh(False): break # Should be okay to run Fabric commands now. def _test_ssh(self, throw=True): sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) try: if "port" in self.configuration: port = int(self.configuration["port"]) else: port = 22 sock.settimeout(1) sock.connect((self.address, port)) return True except socket.timeout: if throw: raise except socket.error, e: if throw or e.errno != 111: raise finally: sock.close() return False HTH, Pat On 06/14/2010 12:16 PM, Matt Calder wrote:
Patrick, I thought you were on to something there, but alas no. I get the same using both DNS and IP. Both the errors without the fixes described, and correct connections with the fixes. Matt On Mon, Jun 14, 2010 at 5:25 PM, Patrick J McNerthney <address@hidden> wrote:Matt, Try eliminating the use of DNS, ie. "ec2-174-129-96-241.compute-1.amazonaws.com", and instead connect directly to the IP address, ie. 174.129.96.241, to see if that has something to do with it. Pat On 06/14/2010 11:16 AM, Matt Calder wrote:All, After much debugging I finally found a workaround. I'd like to explain what I did in the hopes that someone might see what the underlying problem is. I don't think I made this point explicit in my previous emails, but, I am using fabric as a library. For simplicity, say I have two functions, createInstance, and runStuff. The createInstance function creates an ec2 instance (using boto) and waits for the instance's state to be "running". The runStuff function uses fabric to run code on the instance. So, my program looks like: createInstance() runStuff() If I run it as is, I will get connection failures, inside fabric/network.py: connect, either a socket error or a timeout. I know that ec2 instances can report their state as "running" but still not be ready to take connections. So I added a sleep to my program, createInstance() sleep(240) runStuff() Now, four minutes may seem excessive, but, with four minutes I still get connection errors. During my investigations, I made a few interesting observations. If I place a debugger break point just after the sleep. I can break, and resume and I will not get connection errors. If during the sleep period, I ssh into the instance from a terminal, I will not get connection errors, either in the terminal or in the program when the sleep passes (yes, really). Lastly, if I run just createInstance in one process, then after, run just runStuff in another separate process, I do not get connection errors. The workaround that I found was two part. First, I removed the sleep(240). Instead, I placed a sleep of 20 seconds in paramiko/client.py, at the very beginning of Client.connect. Then I added logic to fabric/network.py connect to retry on timeouts and socket errors up to six times. With these changes, I often connect the first time (that would include one 20 second sleep), and if not, always the second time (in the ten or so runs I have done). Note that the connection errors are occurring prior to any ssh activities, the connection is just getting a socket to port 22 on the ec2 instance. For the record I am running Ubuntu 10.04, however, colleagues report the same errors on Windows and MacOS. I hope someone can provide a reason for the behavior I have been seeing. I don't mind the workaround, but while it works, it is not based on any real understanding of what the problem is. Matt On Thu, Jun 10, 2010 at 8:57 PM, Patrick J McNerthney <address@hidden> wrote:Try using the --disable-known-hosts command line option to see if it has something to do with a prior use of the same ip address. On 06/10/2010 01:19 PM, Matt Calder wrote:Jeff, On Thu, Jun 10, 2010 at 6:54 PM, Jeff Forcier<address@hidden> wrote:Hi Matt, Paramiko doesn't have a connection cache that I'm aware of, but Fabric itself does. However, from your description it sounds like you are creating a new instance and then connecting to it, so I'm not sure why a cache would present a problem.I'm fairly certain fabric's cache is empty, because the code goes into the network.py : connect function. The reason I suggested a "paramiko cache" is that, while it is true that just after an instance goes from "pending" to "running" there is a period when connections fail, but that usually is very brief (< 10 sec). That is why I do a sleep(60) after the startup, to give time for that to settle.If you're rebooting a remote system or doing anything to alter the networking of an already-connected system, then you can force a reconnect by manipulating fabric.state.connections. For example, see what the (master-only) reboot() operation does: http://code.fabfile.org/repositories/entry/fabric/master/fabric/operations.py#L668I will look at that.If the problem is as straightforward as it sounds, though, I'm honestly not sure what's up other than "possible Paramiko bug". Are you getting any prompts or anything when you connect to the new instance by hand?I can log in by hand, completely and correctly, from a terminal. I can do this after the instance is started but before fabric's first run call. The funny thing is, if I do log in from a terminal, the fabric run command will work. So, a pseudo code timeline: # Version 1, this will fail, the run cannot connect to the instance. startInstance() sleep(60) run("ls") # Version 2, this will succeed in running "ls" on the instance. startInstance() sleep(60) # During this sleep, using a terminal, I log into the instance. run("ls") Another variation that works is: # Version 3, this also succeeds. startInstance() sleep(60) <Debugger breakpoint here> Using debugger, look at variables (no changes), proceed run("ls") It is the examples that work that shout out "threading error" or "caching error" to me.Another thing to try is to upgrade Paramiko to 1.7.6 if you're using the bundled 1.7.4.I will try that. Thanks for taking the time to help! Matt-Jeff On Thu, Jun 10, 2010 at 5:38 PM, Matt Calder<address@hidden> wrote:Bruno, No it is in a good group. I can log in using fabric if I restart it and the instance is already running. I can see that fabric is inside network.py trying to make the connection. I get one of two errors: either timeout or low level socket error. In debugging, I added retries to network.connect and it will fail repeatedly. First it times out a few times, then gives the "low level socket" error. While it doing that, I can ssh into it from a terminal. I wonder does paramiko have a connection cache ? Maybe it is not really retrying? Thanks for any help. Matt On Thu, Jun 10, 2010 at 5:23 PM, Bruno Clermont <address@hidden> wrote:Is your instance in a security group that allow your IP and the port your trying to connect to? If it timeout, it's probably blocked by Amazon firewalls. On Thu, Jun 10, 2010 at 15:07, Matt Calder<address@hidden> wrote:Hi, I am having problems using fabric with EC2 instances. I am not entirely sure fabric is even the source of the problem, but I am hoping someone on this list can suggest a solution or a path to investigate. Here is the problem. I start an EC2 instance using boto. I wait for the instance to report its state as "running". I wait an addition 60 seconds after that. Then I try to "run" things on the instance through fabric. At that point I get: address@hidden run: ls Fatal error: Timed out trying to connect to ec2-174-129-96-241.compute-1.amazonaws.com Aborting. Now, the interesting thing is this. During that additional 60 second wait I can log into the instance from a separate terminal, moreover, when I do that separate login, the fabric login succeeds. Obviously, there is not a lot to go on here, but I am not entirely sure what additional information would be helpful. If anyone has a suggestion of what I might try to do, I would greatly appreciate it. Thanks, Matt _______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user_______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user-- Jeff Forcier Unix sysadmin; Python/Ruby developer http://bitprophet.org_______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user_______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user_______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user_______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user_______________________________________________ Fab-user mailing list address@hidden http://lists.nongnu.org/mailman/listinfo/fab-user
[Prev in Thread] | Current Thread | [Next in Thread] |