Difference between revisions of "Troubleshooting"

From Ancient Anguish Mud Wiki - AAwiki

Line 61: Line 61:
 
=="Name or service not known" when trying to connect==
 
=="Name or service not known" when trying to connect==
  
It's likely that your local name server has no idea where you are trying to head. A name server is like a phone book service: you tell it you want to talk to anguish.org, and it tells you that the IP address number you want is 208.20.1.214 at the moment. If your local name server is truly confused, you should also not be able to connect to e.g. http://www.google.jp or another site you haven't visited before; your machine might have locally memorized the address of a site you've just visited a moment earlier, however.  
+
It's likely that your local name server has no idea where you are trying to head. A name server is like a phone book service: you tell it you want to talk to anguish.org, and it tells you that the IP address number you want is 67.203.2.146 at the moment. If your local name server is truly confused, you should also not be able to connect to e.g. http://www.google.jp or another site you haven't visited before; your machine might have locally memorized the address of a site you've just visited a moment earlier, however.  
Possible solution: Check that you have not misspelled the server name, or try to connect to the raw IP address directly.  
+
Possible solution: Check that you have not misspelled the server name, or try to connect to the raw IP address directly.
  
 
==No response at all when trying to connect==
 
==No response at all when trying to connect==

Revision as of 06:52, 27 July 2013

Troubleshooting your AA connection

Below are some typical symptoms you might run into, and some possible causes and solutions.

There are two kinds of reasons you might not be able to play on AA:

  1. The AA server process is not talking to you.
  2. Your machine and AA's machine cannot connect over the Internet.

The server to connect to is ancient.anguish.org, port 2222. The direct IP address is 67.203.2.146

(in the year 2013 at least). You can telnet to it directly, or use a client on the AA pages. The testmud is at port 3333, and the FTP port for wizzes is 3001 (2007); I recall this has changed a couple of years ago. The AA website is on http://anguish.org (you can add www or anguish on the start if you like, they all go to the same place). 

This document refers to pingging and tracerouteing several times. Ping sends a "ping!" to a server, waits for a response, and tells you how long it took. Traceroute sends a series of pings to figure out what route your package takes to get to the destination. A Windows user may have to download a tool to do this, but most operating systems packages come with basic networking tools like these. If you need to get some for yourself, do edit this page or the discussion side to comment what you did.

Connection works, but is laggy or works only in bursts, then kicks out

This is the typical problem you will run into with any real-time interactive Internet service. Check how widespread the problem is by e.g. connecting to various websites. Try first a website near you (e.g. neighbouring university, your service provider's homepage - usually close in network terms but not always, or your local Google variant), then a website further away (other side of the continent, like a further-away university, and on another continent - note that e.g. Google's routing might be too clever for you and redirect google.jp to a server right next door). E.g. the University of Tokyo is rather far from Finland, and the traffic passes through New York when coming from here. The University of Helsinki servers are reasonably far from residents in the States, but then again, as long as AA is on the same continent, you do not really need to test if you can get to Europe. ;)

  • If you get slow response from everywhere, e.g. various unrelated websites as well, it is probably your Internet connection that is clogged up.
    • Possible solution: Shut down processes that might eat up bandwith (peer-to-peer software, big uploads or downloads). If this does not work, you can contact the troubleshooting of your ISP.
  • If you get slow response from some sites but not others (particularly close by ones), third party routers between you and AA may be lagged. You can try traceroute to see whereabout the problem lies; the probable only solution is to wait it out.
  • If you get slow response from AA only, it may be that the AA ISP's connection is stuffed. The AA connection is not used for downloading latest movies, so it's not likely to just run out. However, there's an off chance that someone's running a (distributed) denial of service attack and trying to swamp the connection - either to AA directly or to the ISP generally.

Did it work before?

The troubleshooting instructions here assume that you are connectiong to the right machine at the right port, and that you are yourself connected to the Internet. That is, if your connection has worked before and you have not changed anything, feel free to scroll down.

If you are setting up a connection for the first time on a new mud client, for example, first check for the obvious issues:

  • Are you connecting to either ancient.anguish.org or anguish.org?
  • Are you connecting to port 2222 (or 3333 for the testmud)?

How to check these varies depending on your mud client. You can always try the Java web client for this, since it's relatively likely to be correctly set up.

Are you online and not firewall-blocked?

To check that you are connected to the Internet, you can e.g. fire up your browser and

  • Open a website or few that are unlikely to be actually down, e.g. http://www.google.com/ and your service provider's homepage.
  • Make sure that you are not just talking to a web proxy by e.g. making a query you haven't done before on Google, or otherwise going to a page you haven't visited before/in ages.
  • If these work, try opening the AA website: http://anguish.org/

If the other websites or some of them do not open, you're probably not too well connected to the Internet. It's possible that your connection to your service provider works fine, but you cannot get connected abroad or otherwise further away. This kind of problem you will probably just have to wait out; you can try to traceroute your packets to see where the problem is exactly: the last network addresses you see are the troubled ones, and your packets may either be lost to stuffed routers or end up in an eternal loop. If the AA website is practically the only one causing trouble, then scroll down for further troubleshooting.

Most firewalls allow web traffic (goes through port 80, possibly through a proxy), but the AA port 2222 may be blocked. It's not very straightforward to test for this; it's possible (Fir's is not too well-informed) that the AA Java web client gets past the firewall, so you can try if that works. If you suspect you might be behind a firewall, you'll have to ask your system administrator (or boss ;)) about that.

"Connection refused" when trying to connect

This means that your packets made it through, and there is a machine there responding "I don't wanna". So it's a problem of type 1: The AA server process not talking to you.

First, it's quite possible that AA is just rebooting. It does this every two days or so (May 2007 - longer reboots are possible in the future). To check this, go to the AA website and check the right-hand corner for when reboot is scheduled at. The AA process rebooting does not affect the website.

  • Possible solution: Wait for a minute, it'll come up by itself. Then retry.

If the server isn't coming up, you can try to mail an admin (ie. arch or senator) that'd be likely to be awake at the hour. username@anguish.org addresses work for most. See also Current Arches and Members of Committee.

You can also try if you can connect to the testmud, even if you cannot log on there. If the login screen shows up ok, then there's something funky going on in the 2222 side only. Typically the testmud and normal sides reboot at pretty much the same schedule, but 2222 could probably also crash in a way that doesn't bring down the testmud.

"Name or service not known" when trying to connect

It's likely that your local name server has no idea where you are trying to head. A name server is like a phone book service: you tell it you want to talk to anguish.org, and it tells you that the IP address number you want is 67.203.2.146 at the moment. If your local name server is truly confused, you should also not be able to connect to e.g. http://www.google.jp or another site you haven't visited before; your machine might have locally memorized the address of a site you've just visited a moment earlier, however. Possible solution: Check that you have not misspelled the server name, or try to connect to the raw IP address directly.

No response at all when trying to connect

If the AA machine is down, you will get no response to your connection attempt, and no response to ping. You might get a "timed out" error message from your client, though - it just means it did not hear back and gave up.

There are three reasons why you might not get a response, however:

  • The AA machine is down, or
  • Some router on your way to the AA machine is down - this includes firewalls that pretend the world is down.

Use traceroute to check which one is the case. Currently (May 2007), we are sitting right next to Sprintlink's Chicago routers, and the last two hops before anguish.org should probably be sl-gw31-chi-*.sprintlink.net (probably * = 10-0) and sl-local-1-0.sprintlink.net. If you do not get to these hops and especially if you get stuck before you hit sprintlink.net routers at all, it's not an AA connection problem - there's a bigger blackout going on. If you never get further than a hop or two, you may also be firewalled.

Assuming that the problem is AA's machine, make sure the entire machine is indeed down by trying to visit the website - it should not come up. If the website works, you're probably suffering from a firewall problem.

If the AA machine has indeed crashed, you can try to mail an admin (see "Connection refused" above) of the Chicago area to go see what's going on - or an admin who'll be likely to have the phone number of someone who can go see. The machine might also reboot by itself (this is different from just the AA server process "rebooting" through Armageddon) and be up in a jiffy.

AA isn't on a machine that goes about rebooting itself regularly though, and the most likely reason for the entire machine to crash is a power outage - check for news of storms in the Chicago area.

Can see what's going on, but can't send commands

If you can see what is going on in just about real time, but never see your commands going through, the likely cause is that your outgoing connection is overloaded:

  • Possible solution: If you are running peer-to-peer or other software that sends lots of data out (e.g. uploading large files on a server, sending an email with really big attachments), limit the outgoing bandwith it can use or stop it altogether.

A typical network connection has less capacity for sending stuff out than for receiving it, because an average home user simply takes in much more than sends out (e.g. out: request for web page and its contents, in: a megabyte of images attached to the page). If you use up most of the outbound capacity, your command packets to AA have trouble getting there.

I've run into a special case of this problem shutting down the entire connection. A network connection through cable and some thoroughly unclever routing was involved. The connection got so choked on the outgoing side that the magical "ack: I've received package number x" packets did not make it through anymore. This meant that the incoming traffic consisted of various servers patiently trying to resend the packets that were already received instead of new stuff, because they never heard what was going on. As a result, no useful traffic was moving either way; it was easily solved by throttling traffic down.

(The asynchronity problem was brought up on the adv. board in February 2007.)

The Sprintlink router loop of death

AA has been residing around the Chicago area in 2007, with an Internet service provider (ISP) connecting directly to Sprintlink routers. In December 2006 and May 2007, the Sprintlink router loop of death has been kicking people off AA; I'm documenting the symptoms here so that you can be sure when it's not caused by this effect.

Note that Sprintlink is a part of the Internet backbone in certain parts of the world, and the main way available to us to avoid this problem would be to have the server somewhere far away from it. A previous server location at Inreach was apparently not directly connected to Sprintlink. Back in the day before that, when we were close to Sprintlink again, there was a concept of "Sprintlink lag" which referred to Sprintlink's routers being overloaded and responding sluggishly - this slowed down the connection.

The main symptom is that if you try to ping AA, you get "time to live exceeded" responses - the packages don't go through because the route they try to travel is longer than is reasonable. This is caused by a loop. Your traceroute might look like this (pay particular attention to the end parts; the start depends a lot on where you're coming from):

traceroute to ancient.anguish.org (208.20.1.214), 30 hops max, 40 byte packets
 1  nblzone-241-gw.nblnetworks.fi (83.145.241.254)  4.251 ms  4.060 ms  4.105 ms
 2  r1-ge1.hki.nbl.fi (217.30.183.140)  12.676 ms  4.427 ms  4.058 ms
 3  r4-ge1.hki.nbl.fi (80.81.160.233)  45.539 ms  4.312 ms  4.644 ms
 4  ge-0-0-0.se-sthms001-pe-1.tu.telenor.net (212.105.101.198)  12.366 ms  12.231 ms  12.285 ms
 5  213.242.110.1 (213.242.110.1)  15.891 ms  12.299 ms  29.597 ms
 6  ge-0-0-0.mp1.Stockholm1.Level3.net (4.68.96.221)  11.778 ms ge-2-0-0.mp1.Stockholm1.Level3.net (4.68.125.217)   12.200 ms  16.295 ms
 7  ae-1-0.bbr1.London1.Level3.net (212.187.128.58)  63.034 ms as-2-0.bbr2.London1.Level3.net (4.68.128.213)  61.516 ms ae-1-0.bbr1.London1.Level3.net (212.187.128.58)  61.802 ms
 8  ae-21-56.car1.London1.Level3.net (4.68.116.175)  62.178 ms ae-11-53.car1.London1.Level3.net (4.68.116.79)  62.362 ms ae-21-54.car1.London1.Level3.net (4.68.116.111)  62.477 ms
 9  sl-bb21-lon-10-0-0.sprintlink.net (213.206.131.21)  50.174 ms  50.908 ms  50.510 ms
10  sl-bb22-lon-3-0.sprintlink.net (213.206.129.153)  51.001 ms  50.625 ms  50.784 ms
11  sl-bb20-nyc-2-0.sprintlink.net (144.232.9.163)  114.044 ms  114.219 ms  114.728 ms
12  sl-bb22-nyc-8-0.sprintlink.net (144.232.7.106)  114.343 ms  114.246 ms  115.028 ms
13  sl-bb21-chi-9-0.sprintlink.net (144.232.9.149)  135.976 ms  136.444 ms  136.215 ms
14  sl-gw31-chi-10-0.sprintlink.net (144.232.26.30)  136.741 ms  135.625 ms  136.007 ms
15  sl-local-1-0.sprintlink.net (144.223.21.222)  182.631 ms  175.784 ms  163.858 ms
16  sl-gw31-chi-5-0-21-TS0.sprintlink.net (144.223.21.221)  166.469 ms  167.221 ms  194.358 ms
17  sl-local-1-0.sprintlink.net (144.223.21.222)  212.078 ms  214.740 ms  217.458 ms
18  sl-gw31-chi-5-0-21-TS0.sprintlink.net (144.223.21.221)  197.881 ms  189.437 ms  196.822 ms
19  sl-local-1-0.sprintlink.net (144.223.21.222)  232.501 ms  234.763 ms  227.377 ms
20  sl-gw31-chi-5-0-21-TS0.sprintlink.net (144.223.21.221)  225.450 ms  230.116 ms  222.850 ms

The loop starts at line 14. Normally, the end looks like this:

13  sl-bb21-chi-3-0-0.sprintlink.net (144.232.20.102)  136.452 ms  135.958 ms  136.245 ms
14  sl-gw31-chi-10-0.sprintlink.net (144.232.26.30)  135.736 ms  136.121 ms  136.249 ms
15  sl-local-1-0.sprintlink.net (144.223.21.222)  157.928 ms  155.738 ms  155.607 ms
16  anguish.org (208.20.1.214)  156.420 ms  154.941 ms  155.209 ms

That is, the packets travel from the sl-local-1-0 to anguish.org, and everyone's happy. But here, there's some reason why sl-local-1-0 thinks it's not the closest router to AA, but figures sl-gw31-chi-5-0-21-TS0 is closer; the chi-5 one again thinks that sl-local-1-0 is closer, and returns the packet there. This can be caused by two kinds of problems:

  • The connection between sl-local-1-0 and anguish.org is really broken, possibly due to our ISP messing up instead of Sprintlink. But sl-local tries to be clever: it has heard from chi-5 that it's got a better route, which happens to involve packets going through sl-local. Either one's not updating their routing table properly.
  • The chi-5 router is truly messed up, and advertises a route even shorter than sl-local-1-0's - probably one with 0 hops. Then its own notes are sane enough to actually try to forward the packets to sl-local as it should, but sl-local just sends them back. I doubt sl-local'd do this if it were completely sane either.

This kind of trouble can be connected with Sprintlink upkeep. There were reports of a "crash" a couple of days before, but it's unclear if that was really just this same network problem. It seems Sprintlink's maintenance view's got two emergency upkeep events logged at 22nd-23rd May 2007 that happened in the Chicago area: http://www.sprintlink.net/maintview/. Neither machine affected are the same as I've seen in my traceroute, but it might well be related. Sprintlink instructs non-Sprintlink customers to send mail to noc@sprint.net for latency problems that have been identified to be in the Sprintlink network. Calling would be preferrable to know if they do anything about it, but I'm (Fir) in Finland and their contact numbers don't seem to be designed for that: http://www.sprint.com/contactus/

All this put with less techno babble: our faithful Internet Protocol packets containing our chats and smites get hijacked on their way to AA, and then juggled back and forth between two routers who can't decide who's the one more confused. This alone doesn't indicate that AA's crashed; it might be living happily waiting for our connections, and our gear might be sitting on our linkdead statues. Some happy AAers might even be mudding right now and be blissfully ignorant of our problem.