Microsoft Exchange Server Logo

Exchange DAG Replication Problem: An established connection was aborted by the software in your host machine

I had an issue with a four node DAG where the DR site with two of the DAG members were having replication issues. It was only technically affecting one DAG Member though. The copy queue length was really high and the logs were not committing to the database. A Test-ReplicationHealth cmdlet test told that the copy queue length for the affected database copy was high. No other databases were affected as there were eight databases on this DAG Node. The issue was that the log files were not replicating properly to the one DAG member for that database, causing the log file drives on all the other DAG members to build and become full:

EX04 DAG member has high Copy Queue Length
The purple member (EX04) free space is different from the other three DAG members

Circular Logging was turned on, but since the db was NOT in sync, the logs could NOT truncate properly which rendered CL useless. What was being done to stave the issue was to suspend the database copy of the affected DAG member (EX04), then resume the copy. The logs would replay and commit to the database copy on the DAG member, but over a short period of time, the same issue would arise again, as shown in this graph:

You can see the other DAG members start dropping in free space

There were absolutely no errors in the Event Viewer showing this replication issue. After some research, I ran the following cmdlet showing a particular output parameter that gave me the actual problem:

Get-MailboxDatabaseCopyStatus DAG1DB01 | ft -a -wr Name, Status, IncomingLogCopyingNetwork

Output with the actual error listed for the DR DAG members.

The operative error here was: {An error occurred while communicating with server ‘EX01’. Error: Unable to read data from the transport connection: An established connection was aborted by the software in your host machine.} 

Now even though only EX04 was actually having problems with its log replication, both DR members EX03 & EX04 were having the same problem. Again, there were NO events in event viewer showing this issue. I next did some connectivity tests to EX01 from EX04 even though the error said there was an established connection that was broken.

Ping EX01 -f -l 1472

Now the -f states do NOT fragment the packet and send it as a whole to the destination.
The -l states the packet/buffer size you want sent. In this case 1472 bits.
By doing this, you are able to assure that a router or switch is NOT segmenting the packets, packet segmentation of replication logs can cause data corruption and replication issues.

That test passed successfully. I also did a trace route to assure there was no packet loss on the route to the replicating server. That test passed successfully.

I next checked the DAG Network to assure that all networks were working for replication. Now, in this scenario, there was only ONE DAG Network, there was NOT a separate Replication Network. I did not design the DAG and limitations most likely came into play during the design. From my experience, you setup a separate replication network for replication only, but if your network has enough bandwidth, and the design calls for simplification, you can use one DAG network in your design.

Get-DatabaseAvailabilityGroupNetwork | fl 

RunspaceId : a1600003-8074-4000-9150-c7800000207f 
Name : MapiDagNetwork 
Description : 
Subnets : {{192.168.1.0/24,Up}, {192.168.2.0/24,Up}} 
Interfaces : {{EX01,Up,192.168.1.25}, {EX02,Up,192.168.1.26},{EX03,Up,192.168.2.25}, {EX04,Up,192.168.2.26}} 
MapiAccessEnabled : True 
ReplicationEnabled : True 
IgnoreNetwork : False 
Identity : DAG1\MapiDagNetwork 
IsValid : True 
ObjectState : New 

All the DAG Network Members were up and not showing errors. I next did a telnet session to EX01 over the default DAG replication port 64327 to see if there would be any connectivity issues to EX01:

telnet EX01 64327

That test was successful and there were no connectivity issues to EX01 from EX04. Again, there was only ONE database out of eight that was having replication problems. After mulling over the problem, it was decided to restart the MSExchangeRepl service on EX03 AND EX04 since the error was present on both DAG members. We would then, suspend the database copy and resume the database copy on the affected servers.

Run on EX03:
Restart-Service MSExchangeRepl
Suspend-MailboxDatabaseCopy DAG1DB01/EX03 -Confirm:$False
Resume-MailboxDatabaseCopy DAG1DB01/EX03 -Confirm:$False

Run on EX04:
Restart-Service MSExchangeRepl
Suspend-MailboxDatabaseCopy DAG1DB01/EX04 -Confirm:$False
Resume-MailboxDatabaseCopy DAG1DB01/EX04 -Confirm:$False

After monitoring the databases and log drives, the issue was resolved and replication started functioning properly.

Log Drive Available Space Returned to Normal for DAG members

PLEASE COMMENT! I WELCOME SUGGESTIONS, TIPS, ALTERNATIVE TROUBLESHOOTING! HAVE A GREAT DAY!

Get E-Mail Updates
I agree to have my personal information transfered to MailChimp ( more information )
Want to know when I post new content? Sign up here to get an email when I do post!
I despise spam. Your email address will not be sold or shared with anyone else.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.