Examples of recent, very successful troubleshooting engagements that produced fast results :
|Duration||Client||Situation and Outcome|
|5 days||State Government||A new application performed very badly when it was tested under the assumed maximum load.
Packet captures were taken on a user PC while various application functions were exercised. Simultaneously, packet captures were taken on the Oracle database server and on both sides of an intermediate Check Point firewall. This was done for both no-load and heavy-load situations.
Measurements of the packet transit times between each of the capture points showed that the performance degradation was entirely due to the firewall. Transit times through the firewall increased from the no-load sub-millisecond to an average 30 ms and maximum 80 ms (each way) when under load.
The firewall was upgraded in order to cope with the expected loads.
Also, a little known Check Point and Oracle performance issue was discovered and remediated.
|4 days||Federal Government||Performance of reports from a SAS system (with Teradata backend database) was extremely variable.
Packet captures were taken on a user PC while various reports were run.
The analysis revealed that the SAS system was retrieving large numbers of rows of data (up to 200,000) from the backend database – but was doing it just one row at a time.
The analysis also provided a clean bill of health for the WAN (which was initially strongly suspected by some of the support people).
|4 days||Airline||Performance of web browsing for international users at two sites was very poor. Stopwatch measurements taken when a PC used the official proxy servers (versus going directly to the Internet without a proxy) showed a significant performance difference.
Naturally, the initial thoughts were that the proxy servers (provided by an external vendor) were at fault. The support staff were proposing to change proxy vendors.
However, analysis of packet captures taken on the user’s PC during the test runs showed that the degradation was actually caused by the way the browser handled TCP/IP connections to the proxies.
Changing proxies would not have achieved any improvement, unless changed to a different “style” such as transparent proxies.
|7 days||State Government||A newly purchased application was not performing as well as desired.
Packet captures were taken on a user PC while various application functions were exercised. The exact transaction/network/server behaviours for each function were measured and plotted.
The major cause of poor performance was the way the application accessed the Oracle database. Data was requested in very small chunks, resulting in tens of thousands of unnecessary network round-trips.
Spreadsheets listing the exact SQL statements involved with each function (each with the exact breakdown of client/network/server timings) was provided to the vendor. They were able to modify their application code to dramatically improve efficiency, significantly improving performance.
|1 day||State Government||A newly purchased application was still not performing as well as desired.
It was found that 75% of the time taken for each user function was actually inside the user PC (that is, related to the internal PC application)
Spreadsheets and charts plotting the exact network activities involved with each function (with the exact breakdown of client/network/server timings per application transaction) were provided to the vendor. Any improvements would require application code modifications.
The network and server infrastructure were found “not guilty” of any performance problems.
|9 days||Large Service Provider||Credit card transactions flowing via MQ between mainframes at a major bank and a service provider were experiencing frequent delays of up to a second. Although only 1% of transactions were affected, the total delays added up to 12% (7 minutes per hour of no activity).
The provider staff and their customer had spent months trying to identify the source of the problem – with no success.
Packet captures taken with NetScouts at various points in the network path revealed that something on the customer’s internal network was delaying packets by 10ms. Due to a combination of circumstances, including out-of-order packets, this was causing packet drops and large retransmission delays.
Several possible fixes and workarounds were recommended. A change to the mainframe TCP settings removed the gaps, improved performance – but more importantly, gave the bank more confidence in the system moving into the heavy Christmas period.
|3 days||Telco||Sydney users of an internal SAP application (with server in Asia) experienced occasional failed logons. This did not happen to users outside Australia.
The problem had existed for several weeks.
Analysis of packet captures taken at a user PC in Sydney revealed that a WAN accelerator at the Sydney end of the international WAN was improperly responding to HTTP Authentication requests.
Three main alternatives were recommended.
|3 days||Network Provider||User logons to a new application deployed to users were normally 10-15 seconds but were often taking over 2 minutes (and regularly failing). Three different vendors were involved in providing the application, server and associated network components and were all “pointing the finger” at each other.
Packet captures in 3 locations (user PC, vendor hosted server in different city and WAN provider router) revealed that the “trigger” event was occasional packet losses in the customer’s own network (which was easily fixed).
Arguably, the real cause was the low bandwidth WAN link, coupled with “bad” behaviour in the customer’s own Cisco ASA firewall as well as behaviour in the application vendor’s F5 load balancer. None of these behaviours were “bad” when taken individually – but became so when combined.
|3 days||Large Service Provider||The provider operated the outgoing Internet proxy environment for a large bank. The bank wanted to implement “per user” authentication mechanism and installed a Bluecoat “AAA” server coupled to a dedicated AD server.
However, whenever the AAA function was enabled, Internet web-pages would take up to 30 seconds to load.
The analysis of 11 GB of traffic revealed that the AAA-to-AD connection was only handling 2 transactions at once, even though many thousands of requests were arriving from the proxies every second. This resulted in an internal AAA queue length of over 30 seconds.
This hinted at a process/thread limit in either the AAA or AD server – and it was subsequently found that the AAA configuration file “Bluecoat.ini” contained the setting “MaxThreads = 2”. Changing this to “5” put performance back to normal
This saved the unnecessary $250,000 expense that was budgeted for a bigger/faster proxy server farm.
|Past 8 years||Various||The general situation is a “tricky”, intermittent problem that numerous vendors and teams are involved in (and none can find any problem in their own area). The problem drags on for weeks or months until eventually I’m called in to identify the exact problem behaviours within days or hours.|