Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic failures in kitchen-ec2 #10

Closed
juliandunn opened this issue Jul 4, 2013 · 9 comments
Closed

Periodic failures in kitchen-ec2 #10

juliandunn opened this issue Jul 4, 2013 · 9 comments
Labels

Comments

@juliandunn
Copy link

Periodically I get failures starting up EC2 machines, like this:

borkbork:~/Dropbox/devel/github/juliandunn/java (travis-ci-demo)$ kitchen test oracle-7-fedora-18
-----> Starting Kitchen (v1.0.0.alpha.7)
-----> Cleaning up any prior instances of <oracle-7-fedora-18>
-----> Destroying <oracle-7-fedora-18>
       Finished destroying <oracle-7-fedora-18> (0m0.00s).
-----> Testing <oracle-7-fedora-18>
-----> Creating <oracle-7-fedora-18>
Called 'load_file' without the :safe option -- defaulting to safe mode.
       EC2 instance <i-511f7733> created.
.............       (server ready)
..       (ssh ready)

       Finished creating <oracle-7-fedora-18> (0m57.97s).
-----> Converging <oracle-7-fedora-18>
>>>>>> Converge failed on instance <oracle-7-fedora-18>.
>>>>>> Please see .kitchen/logs/oracle-7-fedora-18.log for more details
>>>>>> ------Exception-------
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: ec2-user
>>>>>> ----------------------

However, the machine is actually created; if I immediately do "kitchen converge oracle-7-fedora-18", then kitchen successfully logs into the machine and starts converging.

Perhaps there's a race condition in here somewhere? Or kitchen is trying to connect to the SSH port even though it's really not quite ready?

@fnichol
Copy link
Contributor

fnichol commented Jul 23, 2013

Good question, my money would be on a race condition as well. It's possible that the wait logic is returning just a little too quickly when it sees an open TCP socket. Do you see this using any other drivers? Or possibly even certain AMI images?

@rayrod2030
Copy link

I'm seeing the same issue using an ubuntu 10.04 image (ami-1ab3ce73).

kitchen test 10                                                                                                                                                           !10425
-----> Starting Kitchen (v1.0.0.alpha.7)
-----> Cleaning up any prior instances of <default-ubuntu-1004>
-----> Destroying <default-ubuntu-1004>
       Finished destroying <default-ubuntu-1004> (0m0.00s).
-----> Testing <default-ubuntu-1004>
-----> Creating <default-ubuntu-1004>
Called 'load_file' without the :safe option -- defaulting to safe mode.
       EC2 instance <i-7611ef17> created.
..............................       (server ready)
..       (ssh ready)

       Finished creating <default-ubuntu-1004> (0m55.63s).
-----> Converging <default-ubuntu-1004>
>>>>>> Converge failed on instance <default-ubuntu-1004>.
>>>>>> Please see .kitchen/logs/default-ubuntu-1004.log for more details
>>>>>> ------Exception-------
>>>>>> Class: Kitchen::ActionFailed
>>>>>> Message: connection closed by remote host
>>>>>> ----------------------

After that I can kitchen login 10 and kitchen converge 10 with no issues.

@jejohns
Copy link

jejohns commented Aug 27, 2013

Ditto here with ami-1ebb2077 (12.04 LTS).

fnichol added a commit to test-kitchen/test-kitchen that referenced this issue Aug 29, 2013
This may help to deal with instances that show an open TCP socket on
port 22 but are not yet ready for an SSH client connection.

References test-kitchen/kitchen-ec2#10
@fnichol
Copy link
Contributor

fnichol commented Aug 29, 2013

I'm hopeful that the above commit in Test Kitchen core will help us here. Will be in the next release of both gems.

@jejohns
Copy link

jejohns commented Sep 10, 2013

So far so good with the updated Test Kitchen. :D Thanks!

@lancefrench
Copy link

We ran into the issue in the initial report of this issue (where the exception message is a username) because of a race condition with the population of ssh keys on our AMI. It's important to verify that cloud-init is configured and working properly because, if it is, key population will take place before the ssh daemon is launched. (Check /etc/cloud/cloud.cfg.)

In our case, on a CentOS AMI, the default cloud-init user was misconfigured and our root ssh key was only being populated by the S99local script. Because the key copy took place after sshd launched, kitchen-ec2 would bomb when trying to connect early.

@rayrod2030
Copy link

Boom this just fixed my tests as well. Thanks @fnichol!

@fnichol
Copy link
Contributor

fnichol commented Nov 30, 2013

@lancefrench ah, that's pretty insightful and makes a ton of sense if you're baking your own AMIs.

@jejohns and @rayrod2030, thanks for confirming!

/me hopes we're all good here now. 🍰

@fnichol fnichol closed this as completed Nov 30, 2013
@ekrupnik
Copy link

@fnichol Which version of test kitchen is this in? I am running Test Kitchen version 1.0.0.beta.4 and still having these same issues as many users listed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants