[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bacula-devel] Fwd: [Bacula-users] Solaris "Packet size too big" failures


In case this provides clues to a bug...

Begin forwarded message:

> From: Conor Edberg <conor@xxxxxxxxxxxxx>
> Date: December 23, 2008 11:53:03 AM EST
> To: Jason Dixon <jdixon@xxxxxxxxxx>
> Cc: bacula-users@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [Bacula-users] Solaris "Packet size too big" failures
>
>
> On Dec 9, 2008, at 2:36 PM, Jason Dixon wrote:
>
>> On Tue, Dec 09, 2008 at 09:26:38PM +0000, Allan Black wrote:
>>> Jason Dixon wrote:
>>>> Alas, I spoke too soon.  The CatalogBackup job failed again last
>>>> night,
>>>> usual symptoms.
>>>
>>> OK. need to find out what the FD is doing. I would recommend:
>>>
>>> truss -o filename -f -a -e -v all -w 2 -p <FD pid>
>>>
>>> Is it possible to run the catalog backup during the day, by hand?
>>> That way you could avoid leaving the truss running (and collecting
>>> data) all night.
>>
>> I've run it 6 times today, no failures yet.  Frustrating.  Here are
>> the
>> results showing all the previous failures, then the successes today.
>>
>> -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole |
>> grep BackupCatalog | grep '| f '
>> |    60 | BackupCatalog            | 2008-11-08 00:36:12 | B    | F
>> |          1 |     965,869,568 | f         |
>> |   116 | BackupCatalog            | 2008-11-11 23:14:05 | B    | F
>> |          1 |     614,727,680 | f         |
>> |   286 | BackupCatalog            | 2008-11-22 23:14:28 | B    | F
>> |          1 |   5,278,400,512 | f         |
>> |   298 | BackupCatalog            | 2008-11-23 23:14:32 | B    | F
>> |          1 |   4,723,965,952 | f         |
>> |   336 | BackupCatalog            | 2008-11-26 23:14:30 | B    | F
>> |          1 |   3,979,280,384 | f         |
>> |   361 | BackupCatalog            | 2008-11-28 23:16:11 | B    | F
>> |          1 |   2,101,936,128 | f         |
>> |   385 | BackupCatalog            | 2008-11-30 23:13:21 | B    | F
>> |          1 |   3,863,216,128 | f         |
>>
>> -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole |
>> grep BackupCatalog | grep '2008-12-09'
>> |   486 | BackupCatalog            | 2008-12-09 16:46:51 | B    | F
>> |          1 |   3,183,911,677 | T         |
>> |   487 | BackupCatalog            | 2008-12-09 16:55:42 | B    | F
>> |          1 |   3,183,912,220 | T         |
>> |   488 | BackupCatalog            | 2008-12-09 17:03:22 | B    | F
>> |          1 |   3,183,912,724 | T         |
>> |   489 | BackupCatalog            | 2008-12-09 17:18:03 | B    | F
>> |          1 |   3,183,913,233 | T         |
>> |   490 | BackupCatalog            | 2008-12-09 17:23:26 | B    | F
>> |          1 |   3,183,913,740 | T         |
>> |   491 | BackupCatalog            | 2008-12-09 17:29:53 | B    | F
>> |          1 |   3,183,914,246 | T         |
>>
>>
>> --  
>> Jason Dixon
>> OmniTI Computer Consulting, Inc.
>> jdixon@xxxxxxxxxx
>> 443.325.1357 x.241
>>
>
>
>
> I'm seeing similar problems to the one Jason described.  However, in
> addition to the 'Packet Size too big' failure, I see a variety of
> other error messages as well.  FD, SD, and dir are all running on the
> same machine, disk to disk backup. Some examples of the failures:
>
> 20-Dec 05:01 server-sd JobId 328: Fatal error: append.c:190 File index
> from FD not positive or sequential
> 20-Dec 05:01 server-sd JobId 328: Job write elapsed time = 00:18:22,
> Transfer rate = 10.48 M bytes/second
> 20-Dec 05:01 server-fd JobId 328: Fatal error: backup.c:1003 Network
> send error to SD. ERR=Broken pipe
> 20-Dec 05:01 server-fd JobId 328: Error: bsock.c:341 Write error
> sending -1 bytes to Storage daemon:192.168.1.123:9103: ERR=Broken pipe
> 20-Dec 05:01 server-dir JobId 328: Error: Bacula server-dir 2.5.16-
> beta (20Oct08): 20-Dec-2008 05:01:23
>   Build OS:               i386-pc-solaris2.11 solaris 5.11
>
> 05-Dec 13:30 server-fd JobId 275: Fatal error: backup.c:957 Network
> send error to SD. ERR=I/O error
> 05-Dec 13:30 server-sd JobId 275: Fatal error: append.c:259 Network
> error on data channel. ERR=No data available
> 05-Dec 13:30 server-sd JobId 275: Job write elapsed time = 01:36:26,
> Transfer rate = 50.96 M bytes/second
> 05-Dec 13:30 server-sd JobId 275: Error: bsock.c:485 Read error from
> client:192.168.1.123:36643: ERR=No data available
> 05-Dec 13:30 server-dir JobId 275: Bacula server-dir 2.5.16-beta
> (20Oct08): 05-Dec-2008 13:30:43
>   Build OS:               i386-pc-solaris2.11 solaris 5.11
>
> 10-Oct 10:15 server-sd JobId 130: Fatal error: append.c:180 Malformed
> data header from FD:
> 10-Oct 10:15 server-sd JobId 130: Job write elapsed time = 00:03:34,
> Transfer rate = 13.56 M bytes/second\
> 10-Oct 10:15 server-fd JobId 130: Fatal error: backup.c:892 Network
> send error to SD. ERR=Broken pipe\
> 10-Oct 10:15 server-dir JobId 130: Error: Bacula server-dir 2.4.2
> (26Jul08): 10-Oct-2008 10:15:49\
>   Build OS:               i386-pc-solaris2.11 solaris 5.11
>
> 19-Sep 23:05 server-sd JobId 23: Ready to append to end of Volume
> "test1" size=140938726135
> 19-Sep 23:20 server-sd JobId 23: Fatal error: append.c:159 Error
> reading data header from FD. ERR=No data available
> 19-Sep 23:20 server-sd JobId 23: Job write elapsed time = 00:15:04,
> Transfer rate = 8.229 M bytes/second
> 19-Sep 23:20 server-sd JobId 23: Fatal error: bsock.c:415 Packet size
> too big from "client:127.0.0.1:36643. Terminating connection.
> 19-Sep 23:20 server-fd JobId 23: Fatal error: backup.c:892 Network
> send error to SD. ERR=Broken pipe
> 19-Sep 23:20 server-fd JobId 23: Error: bsock.c:306 Write error
> sending 32768 bytes to Storage daemon:127.0.0.1:9103: ERR=Broken pipe
> 19-Sep 23:20 server-dir JobId 23: Error: Bacula server-dir 2.4.2
> (26Jul08): 19-Sep-2008 23:20:18
>   Build OS:               i386-pc-solaris2.11 solaris 5.11
>   JobId:                  23
>   Job:                    Client1.2008-09-19_23.05.49
>
>
>
> Here's the truss of a failure (I think this is the relevant bit),
> using the options from Alan's message above:
>
> 14457/53:       write(6, "\0\080\0 m k Y\r\n M 3 m".., 32772)   =  
> 32772
> 14457/53:       read(14, " a p L S p L 9 f 6 + l 2".., 32768)   =  
> 32768
> 14457/53:       write(6, "\0\080\0 a p L S p L 9 f".., 32772)   =  
> 32772
> 14457/53:       read(14, " I T F a T F B i m 4 S V".., 32768)   =  
> 32768
> 14457/53:       write(6, "\0\080\0 I T F a T F B i".., 32772)   =  
> 32772
> 14457/53:       read(14, " C O / B f / / v\r\n 9 f".., 32768)   =  
> 32768
> 14457/53:       write(6, "\0\080\0 C O / B f / / v".., 32772)   =  
> 32772
> 14457/54:       pollsys(0xFE2DFDF0, 1, 0xFE2DFE98, 0x00000000)  = 1
> 14457/54:               fd=6  ev=POLLRDNORM rev=POLLRDNORM
> 14457/54:               timeout: 5.000000000 sec
> 14457/53:       read(14, " 6 d L r f 3 3\r\n 3 p q".., 32768)   =  
> 32768
> 14457/54:       read(6, 0xFE2DFF7C, 4)
> Err#131 ECONNRESET
> 14457/54:       lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) =
> 0xFFBFFEFF [0x0000FFFF]
> 14457/54:       lwp_exit()
> 14457/53:       write(6, "\0\080\0 6 d L r f 3 3\r".., 32772)   Err#32
> EPIPE
> 14457/53:           Received signal #13, SIGPIPE [ignored]
> 14457/53:       time()                                          =
> 1229375022
> 14457/53:       time()                                          =
> 1229375022
> 14457/53:       write(4, "\0\0\094 J m s g   J o b".., 152)     = 152
> 14457/53:       close(14)                                       = 0
> 14457/53:       close(13)                                       = 0
> 14457/53:       close(12)                                       = 0
> 14457/53:       close(11)                                       = 0
> 14457/53:       close(10)                                       = 0
> 14457/53:       close(9)                                        = 0
> 14457/53:       close(8)                                        = 0
> 14457/53:       close(7)                                        = 0
> 14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
> 14457/53:               tmout: 0.050000000 sec
> 14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
> 14457/53:               tmout: 0.050000000 sec
> 14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
> 14457/53:               tmout: 0.050000000 sec
> 14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
> 14457/53:               tmout: 0.050000000 sec
>
>
> --
> Conor Edberg
> Letters & Science IT
> UC Santa Barbara
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/bacula-users

-- 
Dan Langille
http://langille.org/




------------------------------------------------------------------------------
_______________________________________________
Bacula-devel mailing list
Bacula-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/bacula-devel


This mailing list archive is a service of Copilotco.