11.1_5运行问题 - Agzs/geth-pbft-study GitHub Wiki
1、pbftMessage无法广播出去
当commChan中产生新消息时,protocolManager会调用BroadcastMsg(),该函数如下:
func (pm *ProtocolManager) BroadcastMsg(msg *types.PbftMessage) {
/// TODO: May need to optimize it, only broadcast msg to peers without it. --Zhiguo
/// peers := pm.peers.PeersWithoutTx(hash)
//FIXME include this again: peers = peers[:int(math.Sqrt(float64(len(peers))))]
//=> add PeerWithoutMsg() start. --Agzs
hash := types.Hash(msg)
peers := pm.peers.PeersWithoutMsg(hash)
for _, peer := range peers {
log.Info("peer broadcast msg", "peer", peer.id, "send msg's hash:", hash) //=>test. --Agzs
peer.SendMsg(msg)
}
log.Trace("Broadcast transaction", "hash", hash, "recipients", len(pm.peers.peers)) //=> peers -> pm.peers.peers --Agzs
}
该函数会对加入到p2p的每个peer调用SendMsg(),最终会调用go-ethereum/p2p中Send(p.rw, ConsensusMsg, msg),通过打印数据发现
此时,并没有真正发送出去,此过程还涉及到rlp编码;Send()函数最终会在ethereum/go-ethereum/rlp/typecache.go中genTypeInfo()函数内调用makeDecoder()和makeWriter().
func makeDecoder(typ reflect.Type, tags tags) (dec decoder, err error) {
kind := typ.Kind()
log.Info(" makeDecoder --start--", "typeName", kind) //=>test. --Agzs
switch {
case typ == rawValueType:
return decodeRawValue, nil
case typ.Implements(decoderInterface):
return decodeDecoder, nil
case kind != reflect.Ptr && reflect.PtrTo(typ).Implements(decoderInterface):
return decodeDecoderNoPtr, nil
case typ.AssignableTo(reflect.PtrTo(bigInt)):
return decodeBigInt, nil
case typ.AssignableTo(bigInt):
return decodeBigIntNoPtr, nil
case isUint(kind):
return decodeUint, nil
case kind == reflect.Bool:
return decodeBool, nil
case kind == reflect.String:
return decodeString, nil
case kind == reflect.Slice || kind == reflect.Array:
return makeListDecoder(typ, tags)
case kind == reflect.Struct:
return makeStructDecoder(typ)
case kind == reflect.Ptr:
if tags.nilOK {
return makeOptionalPtrDecoder(typ)
}
return makePtrDecoder(typ)
case kind == reflect.Interface:
return decodeInterface, nil
// case kind == reflect.Map: //=>TODO. --Agzs
// return nil, nil
default:
log.Info(" makeDecoder --default--", "typeName", kind) //=>test. --Agzs
return nil, fmt.Errorf("rlp: type %v is not RLP-serializable", typ)
}
}
- 原因:
1)msg.code和protoclLength问题,学姐已解决,实际上只需要对eth63版本的长度+1就行,eth62版本长度仍然是8。后期添加新的编号时,eth63的长度需要加相应的个数。
2)由于types.PbftMessage所包含的成员变量存在switch分支中未定义的类型,所以将匹配default分支,最终会返回错误。
- 解决方法:
重新定义了types.PbftMessage的结构体,好像不用修该这部分结构体也可以.
在NewView中的XSet成员变量为map[uint64]string类型,而makeDecoder()的switch分支中并没有reflect.Map,如果新增加该分支的话,需要自己编写相关的decoder函数,涉及元数据,容易出错,所以现在临时将Xset由map[uint64]string定义为[]*XSet类型,涉及xset的都需要改变,由于修改内容比较多,在此不详细介绍,后期会在push汇总中介绍。
修改本部分后,其他的signer可收到types.PbftMessage类型,msg.code也为17,但是出现解码错误问题(一)
2、解码错误问题(一)
//位于ethereum/go-ethereum/rlp/typecache.go
func (s *Stream) Decode(val interface{}) error {
if val == nil {
return errDecodeIntoNil
}
rval := reflect.ValueOf(val)
rtyp := rval.Type()
if rtyp.Kind() != reflect.Ptr {
return errNoPointer
}
if rval.IsNil() {
return errDecodeIntoNil
}
info, err := cachedTypeInfo(rtyp.Elem(), tags{})
if err != nil {
return err
}
err = info.decoder(s, rval.Elem()) //=> consensusMsg will call decodeInterface(s *Stream, val reflect.Value) --Agzs
if decErr, ok := err.(*decodeError); ok && len(decErr.ctx) > 0 {
// add decode target type to error so context has more meaning
log.Info("stream.Decode() decErr") //=>test. --Agzs
decErr.ctx = append(decErr.ctx, fmt.Sprint("(", rtyp.Elem(), ")"))
}
return err
}
未重新定义types.PbftMessage前,该部分不报错,其中decoder会调用decodeInterface(s *Stream, val reflect.Value)
重新定义修改后,会执行log.Info("stream.Decode() decErr"),出现decErr错误。
- 问题原因如下:
在新修改的types.PbftMessage结构体中,将payload换成了preprepare、prepare...等指针,如果要发送的消息是相关类型的消息,就赋值给该成员变量,其他成员变量为nil,(比如要发送preprep,则types.PbftMessage的成员变量PrePrepare指针指向preprep,其他成员变量指针都设置为空)。当时只是为了解决消息发送问题,没想到在这出错了。由于新修改的结构体types.PbftMessage中在使用时含有空指针数据,所以会报此错误。新修改的结构体类型如下:
type pbftMessage struct {
PrePrepare *PrePrepare //=> payload = 1
Prepare *Prepare //=> payload = 2
Commit *Commit //=> payload = 3
Checkpoint *Checkpoint //=> payload = 4
ViewChange *ViewChange //=> payload = 5
NewView *NewView //=> payload = 6
//FetchBlockMsg *FetchBlockMsg //=> payload = 7
Sender uint64 //=>add. --Agzs
PayloadCode uint64
Payload interface{}
}
- 解决方法:
重新定义结构体,去除指针定义,采用原先的payload,定义为接口类型:
type pbftMessage struct {
Sender uint64 //=>add. --Agzs
PayloadCode uint64
Payload interface{}
}
问题解决,主节点可以发送消息,其他VP能收到消息,却出现解码错误问题(二)
3、解码错误问题(二) payload
主节点可以发送消息,其他VP能收到消息,可以获取到除payload以外的所有数据,唯独无法识别payload的真实类型(比如PrePrepare),报错如下:
panic: interface conversion: interface {} is []interface {}, not *types.PrePrepare
- 原因:
types.PbftMessage在p2p传播的时候被转换成inteface{}类型,而payload作为其成员变量也是接口类型,由于interface的类型转换比较混乱,参考go:interface{}、断言与类型转换,RLP编码传参的类型是interface,初步猜想RLP可以编码,但是解码的时候,无法读取interface类型的数据。
ethereum中的block定义也有interface{}成员变量:ReceivedFrom interface{},但是该变量并未通过编解码,而是通过request.Block.ReceivedFrom = p,其中p是handleMsg(p *peer)中的参数。
- 解决方法:
将types.PbftMessage中的payload在p2p发送之前读取出来,使用各自的msg.code,分类发送;接收端,分类回收,最后组装成types.PbftMessage,然后进行事件处理,具体修改push汇总
4、sendViewChange报错
- 提示错误:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xc93636]
goroutine 55 [running]:
github.com/yeongchingtarn/geth-pbft/consensus/pbft.(*pbftCore).sendViewChange(0xc4203d2000, 0x10aafe8, 0x3d)
/home/zhiguo/go/src/github.com/yeongchingtarn/geth-pbft/consensus/pbft/viewchange.go:185 +0x736
github.com/yeongchingtarn/geth-pbft/consensus/pbft.(*pbftCore).ProcessEvent(0xc4203d2000, 0xf02820, 0x1a13f30, 0xc4216b9a60, 0x1)
/home/zhiguo/go/src/github.com/yeongchingtarn/geth-pbft/consensus/pbft/pbft-core.go:373 +0xf37
- 原因:其他signer未启动miner.start(),导致signer和signFn没有初始化赋值:
func (s *Ethereum) StartMining(local bool) error {
eb, err := s.Etherbase()
...
//=>add PBFT StratMining. start --Agzs
if pbft, ok := s.engine.(*pbft.PBFT); ok {
wallet, err := s.accountManager.Find(accounts.Account{Address: eb})
if wallet == nil || err != nil {
log.Error("Etherbase account unavailable locally", "err", err)
return fmt.Errorf("singer missing: %v", err)
}
pbft.Authorize(eb, wallet.SignHash)
}
//=>add PBFT StratMining. end --Agzs
...
go s.miner.Start(eb)
return nil
}
- 解决方法:
其他signer节点运行miner.start()
5、其他节点处理消息后,只向主节点回复消息,没有向p2p其他的peer广播
主节点可以向其他所有peer广播,signer1:
INFO [11-04|23:01:43] peer broadcast msg peer=4a3de4861cca1453 send msg's hash:=cf1a9b…818e4d
INFO [11-04|23:01:43] peer.SendMsg() start pbftMessageType=*types.PrePrepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg() msgsize=650 msgcode=17
INFO [11-04|23:01:43] WriteMsg() msg.Code=17 offset=16
INFO [11-04|23:01:43] peer broadcast msg peer=032d933d9b02b13a send msg's hash:=cf1a9b…818e4d
INFO [11-04|23:01:43] peer.SendMsg() start pbftMessageType=*types.PrePrepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg() msgsize=650 msgcode=17
INFO [11-04|23:01:43] WriteMsg() msg.Code=17 offset=16
INFO [11-04|23:01:43] peer broadcast msg peer=906b7ebc5c0f0f5e send msg's hash:=cf1a9b…818e4d
INFO [11-04|23:01:43] peer.SendMsg() start pbftMessageType=*types.PrePrepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg() msgsize=650 msgcode=17
INFO [11-04|23:01:43] WriteMsg() msg.Code=17 offset=16
INFO [11-04|23:01:43] pm.BroadcastMsg() end------------
而其他signer只能向主节点广播,比如signer2节点:
INFO [11-04|23:01:43] handleMsg() ---PrePrepareMsg-----------
INFO [11-04|23:01:43] send preprepare_message to pm.pbftmanager.Queue()
2017/11/04 23:01:43 Replica 1 processing event
2017/11/04 23:01:43 Replica 1 received incoming message from 0
INFO [11-04|23:01:43] recvMsg() test sendID=0 code=1 payload=*types.PrePrepare
2017/11/04 23:01:43 Replica 1 processing event
2017/11/04 23:01:43 Replica 1 received pre-prepare from replica 0 for view=0/seqNo=9
INFO [11-04|23:01:43] Replica storing block in outstanding block store Replica(PeerID)=1 hash=738efc…270394
2017/11/04 23:01:43 Backup 1 broadcasting prepare for view=0/seqNo=9
2017/11/04 23:01:43 Replica 1 received prepare from replica 1 for view=0/seqNo=9
2017/11/04 23:01:43 Replica 1 prepare count for view=0/seqNo=9: 1
2017/11/04 23:01:43 Replica 1 prepare count for view=0/seqNo=9: 1
2017/11/04 23:01:43 send msg to commChan!
INFO [11-04|23:01:43] pm.BroadcastMsg() start------------
INFO [11-04|23:01:43] peer broadcast msg peer=e29caed3b331495a send msg's hash:=a912a3…0f22d3
INFO [11-04|23:01:43] peer.SendMsg() start pbftMessageType=*types.Prepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg() msgsize=40 msgcode=18
INFO [11-04|23:01:43] WriteMsg() msg.Code=18 offset=16
INFO [11-04|23:01:43] pm.BroadcastMsg() end------------
主节点(signer1)可以收到其他三个signer的prepare,然后发出commit,其他signer能收到主节点的commit,但是signer互相之间却收不到各自的prepare,导致其他signer无法发送自己的commit,会进一步触发viewchange。
- 问题原因:
其他signer都只是和signer1建立了联系,但是互相之间没有建立联系.
- 解决方法:
使用admin.addPeer("")进行添加,比如有四个signer,分别为signer1、signer2、signer3、signer4,signer1为主节点:
需要在signer2、signer3、signer4终端都运行admin.addPeer("encode://...signer1_encode...")
需要在signer3、signer4终端都运行admin.addPeer("encode://...signer2_encode...")
需要在signer4终端都运行admin.addPeer("encode://...signer3_encode...")
修改上述错误后,节点间都能收到commit,但是在recvCommit()函数内存在错误。。。
6、recvCommit()问题
在pbft-core.go中,recvCommit()函数中,只能打印出recvCommit(1) --executeOutstanding()
instance.executeOutstanding() ////xiaobei
log.Info("recvCommit(1) --executeOutstanding()") //=>test. --Agzs
instance.helper.manager.Queue() <- execDoneEvent{} ////xiaobei
log.Info("recvCommit(2) --send execDoneEvent") //=>test. --Agzs
instance.finishedChan <- struct{}{} /// inform PBFT consensus is reached. --Zhiguo
log.Info("recvCommit(3) --finished") //=>test. --Agzs
- 原因
instance.helper.manager.Queue() <- execDoneEvent{}语句向Queue()中发送事件,该函数返回值为channel类型,一次只能存储一个事件。运行过程中可能存在拥挤或抢占,有的事件可能无法挤进该channel,或者发送到channel中,刚开始要处理,又被其他事件抢占了。(注:以上只是猜想,由于goroutine的存在,情况不好判定)
- 解决方法
将execDoneEvent{}触发的操作,移到此处,在processEvent()中的注释掉:
instance.executeOutstanding() ////xiaobei
log.Info("recvCommit(1) --executeOutstanding()") //=>test. --Agzs
//============================
//instance.helper.manager.Queue() <- execDoneEvent{} ////xiaobei
instance.execDoneSync()
log.Info("execDoneSync() end") //=>test. --Agzs
if instance.skipInProgress {
instance.retryStateTransfer(nil)
log.Info("retryStateTransfer() end") //=>test. --Agzs
}
//=========================
log.Info("recvCommit(2) --send execDoneEvent") //=>test. --Agzs
instance.finishedChan <- struct{}{} /// inform PBFT consensus is reached. --Zhiguo
log.Info("recvCommit(3) --finished") //=>test. --Agzs
问题解决,并且signer1成功挖出区块1:
INFO [11-05|20:53:51] recvCommit(1) --executeOutstanding()
INFO [11-05|20:53:51] execDoneSync()
2017/11/05 20:53:51 Replica 0 had execDoneSync called, flagging ourselves as out of date
2017/11/05 20:53:51 Replica 0 attempting to executeOutstanding
2017/11/05 20:53:51 Replica 0 certstore map[{v:0 n:3}:0xc436d02720]
INFO [11-05|20:53:51] execDoneSync() end
2017/11/05 20:53:51 Replica 0 has no targets to attempt state transfer to, delaying
INFO [11-05|20:53:51] retryStateTransfer() end
INFO [11-05|20:53:51] recvCommit(2) --send execDoneEvent
INFO [11-05|20:53:51] recvCommit(3) --finished
INFO [11-05|20:53:51] Successfully sealed new block number=1 hash=a7c286…300942
...
INFO [11-05|20:53:51] 🔨 mined potential block number=1 hash=a7c286…300942
...
INFO [11-05|20:53:51] Commit new mining work number=2 txs=0 uncles=0 elapsed=437.671µs
但是,其他signer却提示“Discarded bad propagated block”,见问题9。
7、rocksDB错误(强制退出造成的问题)
panic: Error opening DB: IO error: lock /home/zhiguo/hyperledgerRocksDB/signer1/db/LOCK: Resource temporarily unavailable
- 原因
终端运行时,没有正常退出,导致rocksDB关闭的”善后“工作没有进行。
强制退出:
> exit
INFO [11-05|11:30:03] IPC endpoint closed: /home/zhiguo/pbft/signer2/data/geth.ipc
INFO [11-05|11:30:03] Blockchain manager stopped
INFO [11-05|11:30:03] Stopping Ethereum protocol
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg() returns error
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg() returns error
^C
INFO [11-05|11:30:41] Got interrupt, shutting down...
^Z
[1]+ Stopped geth --datadir ./data --networkid 55661 --pbftid 1 --port 2003 --unlock c28e9d79dfa4a291cbf7422f4ba82fa59e710bc4 console
正常退出:
> exit
INFO [11-05|11:21:42] IPC endpoint closed: /home/zhiguo/pbft/signer2/data/geth.ipc
INFO [11-05|11:21:42] Blockchain manager stopped
INFO [11-05|11:21:42] Stopping Ethereum protocol
INFO [11-05|11:21:42] Ethereum protocol stopped
INFO [11-05|11:21:42] Transaction pool stopped
INFO [11-05|11:21:42] Database closed database=/home/zhiguo/pbft/signer2/data/geth/chaindata
2017/11/05 11:21:42 RocksDB closed!
比较发现,IPC和Blockchain manager均正常关闭,Ethereum protocol、Transaction pool、Database均未正常关闭。
- 解决方法
直接杀死进程:
在终端运行ps指令:
ethtest@ethtest:~/pbft$ ps
PID TTY TIME CMD
21553 pts/23 00:00:00 bash
23747 pts/23 00:00:01 geth
23908 pts/23 00:00:00 ps
强制杀死进程23747:
ethtest@ethtest:~/pbft$ sudo kill -9 23747
[1]+ Killed geth --datadir pbft/signer1/data --networkid 55661 --port 2000 (wd: ~)
(wd now: ~/pbft)
以下是之前的解决方法,比较笨,可直接跳过到下一个问题。
目前没有有效的解决方法,只能找到相关db文件夹,删除之。
删除之后,rocksDB没有问题了,levelDB又出现问题了
INFO [11-05|16:29:09] Starting peer-to-peer node instance=Geth/v1.6.7-stable/linux-amd64/go1.8.3
2017/11/05 16:29:09 Is db path [/home/zhiguo/hyperledgerRocksDB/signer3/db] empty [false]
2017/11/05 16:29:09 Setting rocksdb maxLogFileSize to 10485760
2017/11/05 16:29:09 Setting rocksdb keepLogFileNum to 10
2017/11/05 16:29:09 Setting rocks db InfoLogLevel to 2
INFO [11-05|16:29:09] Allocated cache and file handles database=/home/zhiguo/pbft/signer2/data/geth/chaindata cache=128 handles=1024
Fatal: Error starting protocol stack: resource temporarily unavailable
继续将相关文件删除, 又提示:
INFO [11-05|16:34:50] init protocol manager and commChan in NewProtocolManager()! replica=3
INFO [11-05|16:34:50] Starting P2P networking
Fatal: Error starting protocol stack: listen udp :2005: bind: address already in use
最终,只能将signer目录下的所有文件删除,重新创建账户,初始化的创世区块。。。。。
此外, pbft.json需要进行修改,(当然也可以删除后重新生成),如果进行修改的话,需要修改两部分:
1)授权节点部分
"extraData": "0x0000000000000000000000000000000000000000000000000000000000000000
557afdbcc81c0199305ebbc61580eb0b17be7fdb
575b73df85e22751c143e839ac90c04796463b2f
c28e9d79dfa4a291cbf7422f4ba82fa59e710bc4
73ab48254894b1b645e37bbeec2c55d6c5bd6b04 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
这是四个授权的账户,需要换成新创建的账户。(为了观察方便,我对四个账户添加了换行,原文件中四个账户的地址都是连载一起的)
2)预设账户余额部分
"575b73df85e22751c143e839ac90c04796463b2f": {
"balance": "0x200000000000000000000000000000000000000000000000000000000000000"
},
"c4006961eda5b91c38fd93ee6021679f25b97410": {
"balance": "0x200000000000000000000000000000000000000000000000000000000000000"
}
其中575为signer1的账户地址,c40为node1地址.
8、无法正常退出
> exit
INFO [11-05|17:39:23] IPC endpoint closed: /home/zhiguo/pbft/signer1/data/geth.ipc
INFO [11-05|17:39:23] Blockchain manager stopped
INFO [11-05|17:39:23] Stopping Ethereum protocol
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg() returns error
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg() returns error
WARN [11-05|17:45:09] System clock seems off by -12.297202182s, which can prevent network connectivity
WARN [11-05|17:45:09] Please enable network time synchronisation in system settings.
WARN [11-05|18:05:52] System clock seems off by -12.302529327s, which can prevent network connectivity
WARN [11-05|18:05:52] Please enable network time synchronisation in system settings.
WARN [11-05|18:17:04] System clock seems off by -12.301364649s, which can prevent network connectivity
WARN [11-05|18:17:04] Please enable network time synchronisation in system settings.
WARN [11-05|18:39:11] System clock seems off by -12.291644646s, which can prevent network connectivity
WARN [11-05|18:39:11] Please enable network time synchronisation in system settings.
WARN [11-05|18:54:23] System clock seems off by -12.275869174s, which can prevent network connectivity
WARN [11-05|18:54:23] Please enable network time synchronisation in system settings.
WARN [11-05|19:05:01] System clock seems off by -12.28059504s, which can prevent network connectivity
WARN [11-05|19:05:01] Please enable network time synchronisation in system settings.
一直停顿在这。。。
- 原因未知,目前暂时未重新遇到,一般停顿20s左右就正常退出了。
- 解决方法
法1:连续按10次 ctrl + C,报错退出。虽然是笨方法,但是重新运行geth console不会遇到错误(7)
法2:强制退出,类似问题7的解决方法
9、同步区块模式问题
signer1成功挖出区块1:
INFO [11-05|20:53:51] recvCommit(1) --executeOutstanding()
INFO [11-05|20:53:51] execDoneSync()
2017/11/05 20:53:51 Replica 0 had execDoneSync called, flagging ourselves as out of date
2017/11/05 20:53:51 Replica 0 attempting to executeOutstanding
2017/11/05 20:53:51 Replica 0 certstore map[{v:0 n:3}:0xc436d02720]
INFO [11-05|20:53:51] execDoneSync() end
2017/11/05 20:53:51 Replica 0 has no targets to attempt state transfer to, delaying
INFO [11-05|20:53:51] retryStateTransfer() end
INFO [11-05|20:53:51] recvCommit(2) --send execDoneEvent
INFO [11-05|20:53:51] recvCommit(3) --finished
INFO [11-05|20:53:51] Successfully sealed new block number=1 hash=a7c286…300942
...
INFO [11-05|20:53:51] 🔨 mined potential block number=1 hash=a7c286…300942
...
INFO [11-05|20:53:51] Commit new mining work number=2 txs=0 uncles=0 elapsed=437.671µs
但是,其他signer却提示:
INFO [11-05|20:53:51] recvCommit(1) --executeOutstanding()
INFO [11-05|20:53:51] execDoneSync()
2017/11/05 20:53:51 Replica 1 had execDoneSync called, flagging ourselves as out of date
2017/11/05 20:53:51 Replica 1 attempting to executeOutstanding
2017/11/05 20:53:51 Replica 1 certstore map[{v:0 n:3}:0xc4368ab860]
INFO [11-05|20:53:51] execDoneSync() end
2017/11/05 20:53:51 Stopping timer
INFO [11-05|20:53:51] send commit_message to pm.pbftmanager.Queue()
2017/11/05 20:53:51 Replica 1 has no targets to attempt state transfer to, delaying
INFO [11-05|20:53:51] retryStateTransfer() end
INFO [11-05|20:53:51] recvCommit(2) --send execDoneEvent
INFO [11-05|20:53:51] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|20:53:51] handleMsg() ---NewBlockMsg-----------
INFO [11-05|20:53:51] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|20:53:51] handleMsg() ---NewBlockHashesMsg-----------
WARN [11-05|20:53:51] Discarded bad propagated block number=1 hash=a7c286…300942
INFO [11-05|20:53:52] Read the next message from the remote peer in pm.handleMsg()
INFO [11-05|20:53:52] handleMsg() ---NewBlockMsg-----------
WARN [11-05|20:53:52] Discarded bad propagated block number=1 hash=a7c286…300942
- 原因
func NewProtocolManager(config *params.ChainConfig, mode downloader.SyncMode, networkId uint64, maxPeers int, mux *event.TypeMux, txpool txPool, engine consensus.Engine, blockchain *core.BlockChain, chaindb ethdb.Database) (*ProtocolManager, error) {
// Create the protocol manager with the base fields
manager := &ProtocolManager{
...
}
...
// Figure out whether to allow fast sync or not
if mode == downloader.FastSync && blockchain.CurrentBlock().NumberU64() > 0 {
log.Warn("Blockchain not empty, fast sync disabled")
mode = downloader.FullSync
}
if mode == downloader.FastSync {
manager.fastSync = uint32(1)
}
...
inserter := func(blocks types.Blocks) (int, error) {
// If fast sync is running, deny importing weird blocks
if atomic.LoadUint32(&manager.fastSync) == 1 {
log.Warn("Discarded bad propagated block", "number", blocks[0].Number(), "hash", blocks[0].Hash())
return 0, nil
}
atomic.StoreUint32(&manager.acceptTxs, 1) // Mark initial sync done on any fetcher import
return manager.blockchain.InsertChain(blocks)
}
manager.fetcher = fetcher.New(blockchain.GetBlockByHash, validator, manager.BroadcastBlock, heighter, inserter, manager.removePeer)
return manager, nil
}
manager.fastSync被设置为1了,这是ethereum的默认配置。经查找代码发现,只有执行inserter函数才会提示Discarded bad propagated block,而inserter函数被赋值给Fetcher.insertChain,最终在insert()中被调用:
func (f *Fetcher) insert(peer string, block *types.Block) {
hash := block.Hash()
// Run the import on a new thread
log.Debug("Importing propagated block", "peer", peer, "number", block.Number(), "hash", hash)
go func() {
...
if _, err := f.insertChain(types.Blocks{block}); err != nil {
log.Debug("Propagated block import failed", "peer", peer, "number", block.Number(), "hash", hash, "err", err)
return
}
// If import succeeded, broadcast the block
propAnnounceOutTimer.UpdateSince(block.ReceivedAt)
go f.broadcastBlock(block, false)
// Invoke the testing hook if needed
if f.importedHook != nil {
f.importedHook(block)
}
}()
}
测试clique的时候,遇到同样的问题,具体原因未知,但是在clique中任意两个点运行admin.addPeer(),两两互联后,该问题消失。
后继续查找原因,发现pbft中使用了clique中的snap.Recents,该变量用于标记最近签名的signer,如果该signer近期签过名,后期便不能再次签名,会返回errUnauthorized,主要有以下三个地方存在这个问题(添加注释部分):
func (c *PBFT) verifySeal(chain consensus.ChainReader, header *types.Header, parents []*types.Header) error {
...
// for seen, recent := range snap.Recents {
// if recent == signer {
// // Signer is among recents, only fail if the current block doesn't shift it out
// if limit := uint64(len(snap.Signers)/2 + 1); seen > number-limit {
// return errUnauthorized
// }
// }
// }
// Ensure that the difficulty corresponds to the turn-ness of the signer
inturn := snap.inturn(header.Number.Uint64(), signer)
if inturn && header.Difficulty.Cmp(diffInTurn) != 0 {
return errInvalidDifficulty
}
if !inturn && header.Difficulty.Cmp(diffNoTurn) != 0 {
return errInvalidDifficulty
}
return nil
}
func (c *PBFT) Seal(chain consensus.ChainReader, block *types.Block, stop <-chan struct{}) (*types.Block, error) {
...
//=> --Agzs
// // If we're amongst the recent signers, wait for the next block
// for seen, recent := range snap.Recents {
// if recent == signer {
// // Signer is among recents, only wait if the current block doesn't shift it out
// if limit := uint64(len(snap.Signers)/2 + 1); number < limit || seen > number-limit {
// log.Info("Signed recently, must wait for others")
// <-stop
// return nil, nil
// }
// }
// }
...
// Sign all the things!
sighash, err := signFn(accounts.Account{Address: signer}, sigHash(header, nil).Bytes()) //=> TODO. --Agzs
...
newBlock := block.WithSeal(header)
c.pbft.recvRequestBlock(newBlock)
/// TODO: wait until the commit messages are received by 2/3 PBFT nodes, then return the block --Zhiguo
select {
case <-stop:
log.Info("Stop PBFT algorithm!")
return nil, nil
case <-c.finishedChan:
//c.pbft.lastExec = newBlock.Header().Number.Uint64() //=>TODO. lastExec. --Agzs
return newBlock, nil
}
/// return nil, nil
}
func (s *Snapshot) apply(headers []*types.Header) (*Snapshot, error) {
...
// for _, recent := range snap.Recents {
// if recent == signer {
// return nil, errUnauthorized
// }
// }
...
}
由于返回了errUnauthorized,其后语句无法正常进行,所以会造成该错误。
- 解决方法
注释掉上述三个地方的snap.Recents中返回errUnauthorized的for循环,如上所示。
解决掉上述问题后,signer1可以连续挖块,其他signer也可以import这些区块,包括加入通过addPeer()添加的node也可以import。
注意:
建议运行前删除存储rocksDB的db文件夹,因为运行测试的过程中必定存在错误,而一些错误的preprepare等其他消息将会保存到pset和qset,然后进一步保存到rocksDB,重新启动后,这些数据都会被重新读取,导致一些seqNo出现错误,程序无法正常运行,一定要注意这一点!!!!!!
10、权限问题
在问题9没解决的基础上,猜想问题9解决后,该问题应该会解决。
关闭各终端,然后重新运行geth,又遇到新问题:
signer1报错:
INFO [11-06|10:57:37] Successfully sealed new block number=2 hash=205067…39e5d8
INFO [11-06|10:57:37] 🔨 mined potential block number=2 hash=205067…39e5d8
ERROR[11-06|10:57:37] Failed to prepare header for mining err=unauthorized
signer2报错,其他signer类似:
INFO [11-06|10:57:37] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=220.297µs mgasps=0.000 number=2 hash=205067…39e5d8
ERROR[11-06|10:57:37] Failed to prepare header for mining err=unauthorized
如果正常的话,signer2应该import的blocks的数目为2,并且signer1不会出现未授权情况。
测试clique正常运行情况:
signer1运行结果:
> miner.start()
INFO [11-06|11:10:10] Transaction pool price threshold updated price=18000000000
INFO [11-06|11:10:10] Starting mining operation
null
> INFO [11-06|11:10:10] Commit new mining work number=22 txs=0 uncles=0 elapsed=113.333µs
INFO [11-06|11:10:10] Successfully sealed new block number=22 hash=af7e9b…d34f0c
INFO [11-06|11:10:10] 🔨 mined potential block number=22 hash=af7e9b…d34f0c
INFO [11-06|11:10:10] Commit new mining work number=23 txs=0 uncles=0 elapsed=408.973µs
> admin.addPeer("enode://996058eac555decf228b31699553d28fb1f65fb116f6d98b18b9f42d4d65498a4f9db808ace34211e3b7fad627f028f6c7fdd39b745aa3cdca2ca367ac51b362@127.0.0.1:2000")
INFO [11-06|11:10:20] Successfully sealed new block number=23 hash=0c3286…2bd8eb
INFO [11-06|11:10:20] 🔨 mined potential block number=23 hash=0c3286…2bd8eb
INFO [11-06|11:10:20] Commit new mining work number=24 txs=0 uncles=0 elapsed=391.673µs
INFO [11-06|11:10:30] Successfully sealed new block number=24 hash=383202…0d2002
INFO [11-06|11:10:30] 🔨 mined potential block number=24 hash=383202…0d2002
INFO [11-06|11:10:30] Commit new mining work number=25 txs=0 uncles=0 elapsed=379.078µs
true
node1运行结果:
INFO [11-06|11:10:38] Imported new chain segment blocks=4 txs=0 mgas=0.000 elapsed=1.844ms mgasps=0.000 number=24 hash=383202…0d2002 ignored=4
INFO [11-06|11:10:40] Read the next message from the remote peer in pm.handleMsg()
INFO [11-06|11:10:40] handleMsg() ---NewBlockMsg-----------
INFO [11-06|11:10:40] Read the next message from the remote peer in pm.handleMsg()
INFO [11-06|11:10:40] handleMsg() ---NewBlockHashesMsg-----------
INFO [11-06|11:10:40] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=292.206µs mgasps=0.000 number=25 hash=c0e8f4…e9449d
- 解决方法 问题9解决后,该问题不存在。。。。。。。