11.1_5运行问题 - Agzs/geth-pbft-study GitHub Wiki

1、pbftMessage无法广播出去

当commChan中产生新消息时，protocolManager会调用BroadcastMsg()，该函数如下：

func (pm *ProtocolManager) BroadcastMsg(msg *types.PbftMessage) {
	
	/// TODO: May need to optimize it, only broadcast msg to peers without it. --Zhiguo
	///	peers := pm.peers.PeersWithoutTx(hash)
	//FIXME include this again: peers = peers[:int(math.Sqrt(float64(len(peers))))]
	
	//=> add PeerWithoutMsg() start. --Agzs
	hash := types.Hash(msg)
	peers := pm.peers.PeersWithoutMsg(hash)

	for _, peer := range peers {
		log.Info("peer broadcast msg", "peer", peer.id, "send msg's hash:", hash) //=>test. --Agzs
		peer.SendMsg(msg)
	}

	log.Trace("Broadcast transaction", "hash", hash, "recipients", len(pm.peers.peers)) //=> peers ->  pm.peers.peers --Agzs
}

该函数会对加入到p2p的每个peer调用SendMsg()，最终会调用go-ethereum/p2p中Send(p.rw, ConsensusMsg, msg)，通过打印数据发现此时，并没有真正发送出去，此过程还涉及到rlp编码；Send()函数最终会在ethereum/go-ethereum/rlp/typecache.go中genTypeInfo()函数内调用makeDecoder()和makeWriter().

func makeDecoder(typ reflect.Type, tags tags) (dec decoder, err error) {
	kind := typ.Kind()
	log.Info(" makeDecoder --start--", "typeName", kind) //=>test. --Agzs
	switch {
	case typ == rawValueType:
		return decodeRawValue, nil
	case typ.Implements(decoderInterface):
		return decodeDecoder, nil
	case kind != reflect.Ptr && reflect.PtrTo(typ).Implements(decoderInterface):
		return decodeDecoderNoPtr, nil
	case typ.AssignableTo(reflect.PtrTo(bigInt)):
		return decodeBigInt, nil
	case typ.AssignableTo(bigInt):
		return decodeBigIntNoPtr, nil
	case isUint(kind):
		return decodeUint, nil
	case kind == reflect.Bool:
		return decodeBool, nil
	case kind == reflect.String:
		return decodeString, nil
	case kind == reflect.Slice || kind == reflect.Array:
		return makeListDecoder(typ, tags)
	case kind == reflect.Struct:
		return makeStructDecoder(typ)
	case kind == reflect.Ptr:
		if tags.nilOK {
			return makeOptionalPtrDecoder(typ)
		}
		return makePtrDecoder(typ)
	case kind == reflect.Interface:
		return decodeInterface, nil
	// case kind == reflect.Map: //=>TODO. --Agzs
	// 	return nil, nil
	default:
		log.Info(" makeDecoder --default--", "typeName", kind) //=>test. --Agzs
		return nil, fmt.Errorf("rlp: type %v is not RLP-serializable", typ)
	}
}

原因：

1）msg.code和protoclLength问题，学姐已解决,实际上只需要对eth63版本的长度+1就行，eth62版本长度仍然是8。后期添加新的编号时，eth63的长度需要加相应的个数。

2)由于types.PbftMessage所包含的成员变量存在switch分支中未定义的类型，所以将匹配default分支,最终会返回错误。

解决方法：

重新定义了types.PbftMessage的结构体，好像不用修该这部分结构体也可以.

在NewView中的XSet成员变量为map[uint64]string类型，而makeDecoder()的switch分支中并没有reflect.Map，如果新增加该分支的话，需要自己编写相关的decoder函数，涉及元数据，容易出错，所以现在临时将Xset由map[uint64]string定义为[]*XSet类型，涉及xset的都需要改变，由于修改内容比较多，在此不详细介绍，后期会在push汇总中介绍。

修改本部分后，其他的signer可收到types.PbftMessage类型，msg.code也为17，但是出现解码错误问题(一）

2、解码错误问题(一）

//位于ethereum/go-ethereum/rlp/typecache.go
func (s *Stream) Decode(val interface{}) error {
	if val == nil {
		return errDecodeIntoNil
	}
	rval := reflect.ValueOf(val)
	rtyp := rval.Type()
	if rtyp.Kind() != reflect.Ptr {
		return errNoPointer
	}
	if rval.IsNil() {
		return errDecodeIntoNil
	}

	info, err := cachedTypeInfo(rtyp.Elem(), tags{})
	if err != nil {
		return err
	}

	err = info.decoder(s, rval.Elem()) //=> consensusMsg will call decodeInterface(s *Stream, val reflect.Value) --Agzs
	if decErr, ok := err.(*decodeError); ok && len(decErr.ctx) > 0 {
		// add decode target type to error so context has more meaning
		log.Info("stream.Decode() decErr") //=>test. --Agzs
		decErr.ctx = append(decErr.ctx, fmt.Sprint("(", rtyp.Elem(), ")"))
	}
	return err
}

未重新定义types.PbftMessage前，该部分不报错，其中decoder会调用decodeInterface(s *Stream, val reflect.Value)

重新定义修改后，会执行log.Info("stream.Decode() decErr")，出现decErr错误。

问题原因如下：

在新修改的types.PbftMessage结构体中，将payload换成了preprepare、prepare...等指针，如果要发送的消息是相关类型的消息，就赋值给该成员变量，其他成员变量为nil，（比如要发送preprep，则types.PbftMessage的成员变量PrePrepare指针指向preprep，其他成员变量指针都设置为空）。当时只是为了解决消息发送问题，没想到在这出错了。由于新修改的结构体types.PbftMessage中在使用时含有空指针数据，所以会报此错误。新修改的结构体类型如下：

type pbftMessage struct {
	PrePrepare *PrePrepare //=> payload = 1
	Prepare    *Prepare    //=> payload = 2
	Commit     *Commit     //=> payload = 3
	Checkpoint *Checkpoint //=> payload = 4
	ViewChange *ViewChange //=> payload = 5
	NewView    *NewView    //=> payload = 6
	//FetchBlockMsg *FetchBlockMsg //=> payload = 7
	Sender      uint64 //=>add. --Agzs
	PayloadCode uint64
	Payload     interface{}
}

解决方法：

重新定义结构体，去除指针定义，采用原先的payload，定义为接口类型：

type pbftMessage struct {
	Sender      uint64 //=>add. --Agzs
	PayloadCode uint64
	Payload     interface{}
}

问题解决，主节点可以发送消息，其他VP能收到消息，却出现解码错误问题(二）

3、解码错误问题(二） payload

主节点可以发送消息，其他VP能收到消息，可以获取到除payload以外的所有数据，唯独无法识别payload的真实类型（比如PrePrepare），报错如下：

panic: interface conversion: interface {} is []interface {}, not *types.PrePrepare

原因：

types.PbftMessage在p2p传播的时候被转换成inteface{}类型，而payload作为其成员变量也是接口类型，由于interface的类型转换比较混乱，参考go：interface{}、断言与类型转换，RLP编码传参的类型是interface，初步猜想RLP可以编码，但是解码的时候，无法读取interface类型的数据。

ethereum中的block定义也有interface{}成员变量：ReceivedFrom interface{}，但是该变量并未通过编解码，而是通过request.Block.ReceivedFrom = p，其中p是handleMsg(p *peer)中的参数。

解决方法：

将types.PbftMessage中的payload在p2p发送之前读取出来，使用各自的msg.code，分类发送；接收端，分类回收，最后组装成types.PbftMessage，然后进行事件处理，具体修改push汇总

4、sendViewChange报错

提示错误：

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xc93636]

goroutine 55 [running]:
github.com/yeongchingtarn/geth-pbft/consensus/pbft.(*pbftCore).sendViewChange(0xc4203d2000, 0x10aafe8, 0x3d)
	/home/zhiguo/go/src/github.com/yeongchingtarn/geth-pbft/consensus/pbft/viewchange.go:185 +0x736
github.com/yeongchingtarn/geth-pbft/consensus/pbft.(*pbftCore).ProcessEvent(0xc4203d2000, 0xf02820, 0x1a13f30, 0xc4216b9a60, 0x1)
	/home/zhiguo/go/src/github.com/yeongchingtarn/geth-pbft/consensus/pbft/pbft-core.go:373 +0xf37

原因：其他signer未启动miner.start()，导致signer和signFn没有初始化赋值：

func (s *Ethereum) StartMining(local bool) error {
	eb, err := s.Etherbase()
	...
	//=>add PBFT StratMining. start --Agzs
	if pbft, ok := s.engine.(*pbft.PBFT); ok {
		wallet, err := s.accountManager.Find(accounts.Account{Address: eb})
		if wallet == nil || err != nil {
			log.Error("Etherbase account unavailable locally", "err", err)
			return fmt.Errorf("singer missing: %v", err)
		}
		pbft.Authorize(eb, wallet.SignHash)
	}
	//=>add PBFT StratMining. end --Agzs

	...
	go s.miner.Start(eb)
	return nil
}

解决方法：

其他signer节点运行miner.start()

5、其他节点处理消息后，只向主节点回复消息，没有向p2p其他的peer广播

主节点可以向其他所有peer广播，signer1:

INFO [11-04|23:01:43] peer broadcast msg                       peer=4a3de4861cca1453 send msg's hash:=cf1a9b…818e4d
INFO [11-04|23:01:43] peer.SendMsg() start                     pbftMessageType=*types.PrePrepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg()   msgsize=650 msgcode=17
INFO [11-04|23:01:43] WriteMsg()                               msg.Code=17 offset=16

INFO [11-04|23:01:43] peer broadcast msg                       peer=032d933d9b02b13a send msg's hash:=cf1a9b…818e4d
INFO [11-04|23:01:43] peer.SendMsg() start                     pbftMessageType=*types.PrePrepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg()   msgsize=650 msgcode=17
INFO [11-04|23:01:43] WriteMsg()                               msg.Code=17 offset=16

INFO [11-04|23:01:43] peer broadcast msg                       peer=906b7ebc5c0f0f5e send msg's hash:=cf1a9b…818e4d
INFO [11-04|23:01:43] peer.SendMsg() start                     pbftMessageType=*types.PrePrepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg()   msgsize=650 msgcode=17
INFO [11-04|23:01:43] WriteMsg()                               msg.Code=17 offset=16
INFO [11-04|23:01:43] pm.BroadcastMsg() end------------

而其他signer只能向主节点广播，比如signer2节点：

INFO [11-04|23:01:43] handleMsg() ---PrePrepareMsg----------- 
INFO [11-04|23:01:43] send preprepare_message to pm.pbftmanager.Queue() 
2017/11/04 23:01:43 Replica 1 processing event
2017/11/04 23:01:43 Replica 1 received incoming message from 0
INFO [11-04|23:01:43] recvMsg() test                           sendID=0 code=1 payload=*types.PrePrepare
2017/11/04 23:01:43 Replica 1 processing event
2017/11/04 23:01:43 Replica 1 received pre-prepare from replica 0 for view=0/seqNo=9
INFO [11-04|23:01:43] Replica storing block in outstanding block store Replica(PeerID)=1 hash=738efc…270394

2017/11/04 23:01:43 Backup 1 broadcasting prepare for view=0/seqNo=9
2017/11/04 23:01:43 Replica 1 received prepare from replica 1 for view=0/seqNo=9
2017/11/04 23:01:43 Replica 1 prepare count for view=0/seqNo=9: 1
2017/11/04 23:01:43 Replica 1 prepare count for view=0/seqNo=9: 1
2017/11/04 23:01:43 send msg to commChan!

INFO [11-04|23:01:43] pm.BroadcastMsg() start------------ 
INFO [11-04|23:01:43] peer broadcast msg                       peer=e29caed3b331495a send msg's hash:=a912a3…0f22d3
INFO [11-04|23:01:43] peer.SendMsg() start                     pbftMessageType=*types.Prepare
INFO [11-04|23:01:43] p2p sends message!!! before writeMsg()   msgsize=40  msgcode=18
INFO [11-04|23:01:43] WriteMsg()                               msg.Code=18 offset=16
INFO [11-04|23:01:43] pm.BroadcastMsg() end------------

主节点(signer1)可以收到其他三个signer的prepare，然后发出commit,其他signer能收到主节点的commit，但是signer互相之间却收不到各自的prepare,导致其他signer无法发送自己的commit，会进一步触发viewchange。

问题原因:

其他signer都只是和signer1建立了联系，但是互相之间没有建立联系.

解决方法：

使用admin.addPeer("")进行添加，比如有四个signer，分别为signer1、signer2、signer3、signer4，signer1为主节点：

需要在signer2、signer3、signer4终端都运行admin.addPeer("encode://...signer1_encode...")

需要在signer3、signer4终端都运行admin.addPeer("encode://...signer2_encode...")

需要在signer4终端都运行admin.addPeer("encode://...signer3_encode...")

修改上述错误后，节点间都能收到commit，但是在recvCommit()函数内存在错误。。。

6、recvCommit()问题

在pbft-core.go中，recvCommit()函数中，只能打印出recvCommit(1) --executeOutstanding()

instance.executeOutstanding()                      ////xiaobei
		log.Info("recvCommit(1) --executeOutstanding()")   //=>test. --Agzs
		instance.helper.manager.Queue() <- execDoneEvent{} ////xiaobei
		log.Info("recvCommit(2) --send execDoneEvent")     //=>test. --Agzs
		instance.finishedChan <- struct{}{}                /// inform PBFT consensus is reached.  --Zhiguo
		log.Info("recvCommit(3) --finished")               //=>test. --Agzs

原因

instance.helper.manager.Queue() <- execDoneEvent{}语句向Queue()中发送事件，该函数返回值为channel类型，一次只能存储一个事件。运行过程中可能存在拥挤或抢占，有的事件可能无法挤进该channel，或者发送到channel中，刚开始要处理，又被其他事件抢占了。（注：以上只是猜想，由于goroutine的存在，情况不好判定）

解决方法

将execDoneEvent{}触发的操作，移到此处，在processEvent()中的注释掉：

                instance.executeOutstanding()                    ////xiaobei
		log.Info("recvCommit(1) --executeOutstanding()") //=>test. --Agzs
		//============================
		//instance.helper.manager.Queue() <- execDoneEvent{} ////xiaobei
		instance.execDoneSync()
		log.Info("execDoneSync() end") //=>test. --Agzs
		if instance.skipInProgress {
			instance.retryStateTransfer(nil)
			log.Info("retryStateTransfer() end") //=>test. --Agzs
		}
		//=========================
		log.Info("recvCommit(2) --send execDoneEvent") //=>test. --Agzs
		instance.finishedChan <- struct{}{}            /// inform PBFT consensus is reached.  --Zhiguo
		log.Info("recvCommit(3) --finished")           //=>test. --Agzs

问题解决，并且signer1成功挖出区块1：

INFO [11-05|20:53:51] recvCommit(1) --executeOutstanding() 
INFO [11-05|20:53:51] execDoneSync() 
2017/11/05 20:53:51 Replica 0 had execDoneSync called, flagging ourselves as out of date
2017/11/05 20:53:51 Replica 0 attempting to executeOutstanding
2017/11/05 20:53:51 Replica 0 certstore map[{v:0 n:3}:0xc436d02720]
INFO [11-05|20:53:51] execDoneSync() end 
2017/11/05 20:53:51 Replica 0 has no targets to attempt state transfer to, delaying
INFO [11-05|20:53:51] retryStateTransfer() end 
INFO [11-05|20:53:51] recvCommit(2) --send execDoneEvent 
INFO [11-05|20:53:51] recvCommit(3) --finished 
INFO [11-05|20:53:51] Successfully sealed new block            number=1 hash=a7c286…300942

...
INFO [11-05|20:53:51] 🔨 mined potential block                  number=1 hash=a7c286…300942

...
INFO [11-05|20:53:51] Commit new mining work                   number=2 txs=0 uncles=0 elapsed=437.671µs

但是，其他signer却提示“Discarded bad propagated block”，见问题9。

7、rocksDB错误(强制退出造成的问题）

panic: Error opening DB: IO error: lock /home/zhiguo/hyperledgerRocksDB/signer1/db/LOCK: Resource temporarily unavailable

原因

终端运行时，没有正常退出，导致rocksDB关闭的”善后“工作没有进行。

强制退出：

> exit
INFO [11-05|11:30:03] IPC endpoint closed: /home/zhiguo/pbft/signer2/data/geth.ipc 
INFO [11-05|11:30:03] Blockchain manager stopped 
INFO [11-05|11:30:03] Stopping Ethereum protocol 
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg() returns error 
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|11:30:03] Read the next message from the remote peer in pm.handleMsg() returns error 
^C
INFO [11-05|11:30:41] Got interrupt, shutting down... 
^Z
[1]+  Stopped                 geth --datadir ./data --networkid 55661 --pbftid 1 --port 2003 --unlock c28e9d79dfa4a291cbf7422f4ba82fa59e710bc4 console

正常退出：

> exit
INFO [11-05|11:21:42] IPC endpoint closed: /home/zhiguo/pbft/signer2/data/geth.ipc 
INFO [11-05|11:21:42] Blockchain manager stopped 
INFO [11-05|11:21:42] Stopping Ethereum protocol 
INFO [11-05|11:21:42] Ethereum protocol stopped 
INFO [11-05|11:21:42] Transaction pool stopped 
INFO [11-05|11:21:42] Database closed                          database=/home/zhiguo/pbft/signer2/data/geth/chaindata
2017/11/05 11:21:42 RocksDB closed!

比较发现，IPC和Blockchain manager均正常关闭，Ethereum protocol、Transaction pool、Database均未正常关闭。

解决方法

直接杀死进程：

在终端运行ps指令：

ethtest@ethtest:~/pbft$ ps
  PID TTY          TIME CMD
21553 pts/23   00:00:00 bash
23747 pts/23   00:00:01 geth
23908 pts/23   00:00:00 ps

强制杀死进程23747：

ethtest@ethtest:~/pbft$ sudo kill -9 23747
[1]+  Killed                  geth --datadir pbft/signer1/data --networkid 55661 --port 2000  (wd: ~)
(wd now: ~/pbft)

以下是之前的解决方法，比较笨，可直接跳过到下一个问题。

~~目前没有有效的解决方法，只能找到相关db文件夹，删除之。~~

删除之后，rocksDB没有问题了，levelDB又出现问题了

INFO [11-05|16:29:09] Starting peer-to-peer node               instance=Geth/v1.6.7-stable/linux-amd64/go1.8.3
2017/11/05 16:29:09 Is db path [/home/zhiguo/hyperledgerRocksDB/signer3/db] empty [false]
2017/11/05 16:29:09 Setting rocksdb maxLogFileSize to 10485760
2017/11/05 16:29:09 Setting rocksdb keepLogFileNum to 10
2017/11/05 16:29:09 Setting rocks db InfoLogLevel to 2
INFO [11-05|16:29:09] Allocated cache and file handles         database=/home/zhiguo/pbft/signer2/data/geth/chaindata cache=128 handles=1024
Fatal: Error starting protocol stack: resource temporarily unavailable

继续将相关文件删除，又提示：

INFO [11-05|16:34:50] init protocol manager and commChan in NewProtocolManager()! replica=3
INFO [11-05|16:34:50] Starting P2P networking 
Fatal: Error starting protocol stack: listen udp :2005: bind: address already in use

最终，只能将signer目录下的所有文件删除，重新创建账户，初始化的创世区块。。。。。

此外， pbft.json需要进行修改，（当然也可以删除后重新生成），如果进行修改的话，需要修改两部分：

1）授权节点部分

"extraData": "0x0000000000000000000000000000000000000000000000000000000000000000
557afdbcc81c0199305ebbc61580eb0b17be7fdb
575b73df85e22751c143e839ac90c04796463b2f
c28e9d79dfa4a291cbf7422f4ba82fa59e710bc4
73ab48254894b1b645e37bbeec2c55d6c5bd6b04 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",

这是四个授权的账户，需要换成新创建的账户。(为了观察方便，我对四个账户添加了换行，原文件中四个账户的地址都是连载一起的)

2）预设账户余额部分

 "575b73df85e22751c143e839ac90c04796463b2f": {
      "balance": "0x200000000000000000000000000000000000000000000000000000000000000"
    },
 "c4006961eda5b91c38fd93ee6021679f25b97410": {
      "balance": "0x200000000000000000000000000000000000000000000000000000000000000"
    }

其中575为signer1的账户地址，c40为node1地址.

8、无法正常退出

> exit
INFO [11-05|17:39:23] IPC endpoint closed: /home/zhiguo/pbft/signer1/data/geth.ipc 
INFO [11-05|17:39:23] Blockchain manager stopped 
INFO [11-05|17:39:23] Stopping Ethereum protocol 
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg() returns error 
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|17:39:23] Read the next message from the remote peer in pm.handleMsg() returns error 
WARN [11-05|17:45:09] System clock seems off by -12.297202182s, which can prevent network connectivity 
WARN [11-05|17:45:09] Please enable network time synchronisation in system settings. 
WARN [11-05|18:05:52] System clock seems off by -12.302529327s, which can prevent network connectivity 
WARN [11-05|18:05:52] Please enable network time synchronisation in system settings. 
WARN [11-05|18:17:04] System clock seems off by -12.301364649s, which can prevent network connectivity 
WARN [11-05|18:17:04] Please enable network time synchronisation in system settings. 
WARN [11-05|18:39:11] System clock seems off by -12.291644646s, which can prevent network connectivity 
WARN [11-05|18:39:11] Please enable network time synchronisation in system settings. 
WARN [11-05|18:54:23] System clock seems off by -12.275869174s, which can prevent network connectivity 
WARN [11-05|18:54:23] Please enable network time synchronisation in system settings. 
WARN [11-05|19:05:01] System clock seems off by -12.28059504s, which can prevent network connectivity 
WARN [11-05|19:05:01] Please enable network time synchronisation in system settings.

一直停顿在这。。。

原因未知，目前暂时未重新遇到，一般停顿20s左右就正常退出了。
解决方法

法1：连续按10次 ctrl + C,报错退出。虽然是笨方法，但是重新运行geth console不会遇到错误(7)

法2：强制退出，类似问题7的解决方法

9、同步区块模式问题

signer1成功挖出区块1：

INFO [11-05|20:53:51] recvCommit(1) --executeOutstanding() 
INFO [11-05|20:53:51] execDoneSync() 
2017/11/05 20:53:51 Replica 0 had execDoneSync called, flagging ourselves as out of date
2017/11/05 20:53:51 Replica 0 attempting to executeOutstanding
2017/11/05 20:53:51 Replica 0 certstore map[{v:0 n:3}:0xc436d02720]
INFO [11-05|20:53:51] execDoneSync() end 
2017/11/05 20:53:51 Replica 0 has no targets to attempt state transfer to, delaying
INFO [11-05|20:53:51] retryStateTransfer() end 
INFO [11-05|20:53:51] recvCommit(2) --send execDoneEvent 
INFO [11-05|20:53:51] recvCommit(3) --finished 
INFO [11-05|20:53:51] Successfully sealed new block            number=1 hash=a7c286…300942

...
INFO [11-05|20:53:51] 🔨 mined potential block                  number=1 hash=a7c286…300942

...
INFO [11-05|20:53:51] Commit new mining work                   number=2 txs=0 uncles=0 elapsed=437.671µs

但是，其他signer却提示：

INFO [11-05|20:53:51] recvCommit(1) --executeOutstanding() 
INFO [11-05|20:53:51] execDoneSync() 
2017/11/05 20:53:51 Replica 1 had execDoneSync called, flagging ourselves as out of date
2017/11/05 20:53:51 Replica 1 attempting to executeOutstanding
2017/11/05 20:53:51 Replica 1 certstore map[{v:0 n:3}:0xc4368ab860]
INFO [11-05|20:53:51] execDoneSync() end 
2017/11/05 20:53:51 Stopping timer
INFO [11-05|20:53:51] send commit_message to pm.pbftmanager.Queue() 
2017/11/05 20:53:51 Replica 1 has no targets to attempt state transfer to, delaying
INFO [11-05|20:53:51] retryStateTransfer() end 
INFO [11-05|20:53:51] recvCommit(2) --send execDoneEvent 
INFO [11-05|20:53:51] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|20:53:51] handleMsg() ---NewBlockMsg----------- 
INFO [11-05|20:53:51] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|20:53:51] handleMsg() ---NewBlockHashesMsg----------- 
WARN [11-05|20:53:51] Discarded bad propagated block           number=1 hash=a7c286…300942
INFO [11-05|20:53:52] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-05|20:53:52] handleMsg() ---NewBlockMsg----------- 
WARN [11-05|20:53:52] Discarded bad propagated block           number=1 hash=a7c286…300942

原因

func NewProtocolManager(config *params.ChainConfig, mode downloader.SyncMode, networkId uint64, maxPeers int, mux *event.TypeMux, txpool txPool, engine consensus.Engine, blockchain *core.BlockChain, chaindb ethdb.Database) (*ProtocolManager, error) {
	// Create the protocol manager with the base fields
	manager := &ProtocolManager{
		...
	}

	...
	// Figure out whether to allow fast sync or not
	if mode == downloader.FastSync && blockchain.CurrentBlock().NumberU64() > 0 {
		log.Warn("Blockchain not empty, fast sync disabled")
		mode = downloader.FullSync
	}
	if mode == downloader.FastSync {
		manager.fastSync = uint32(1)
	}

	...
	inserter := func(blocks types.Blocks) (int, error) {
		// If fast sync is running, deny importing weird blocks
		if atomic.LoadUint32(&manager.fastSync) == 1 {
			log.Warn("Discarded bad propagated block", "number", blocks[0].Number(), "hash", blocks[0].Hash())
			return 0, nil
		}
		atomic.StoreUint32(&manager.acceptTxs, 1) // Mark initial sync done on any fetcher import
		return manager.blockchain.InsertChain(blocks)
	}
	manager.fetcher = fetcher.New(blockchain.GetBlockByHash, validator, manager.BroadcastBlock, heighter, inserter, manager.removePeer)

	return manager, nil
}

manager.fastSync被设置为1了，这是ethereum的默认配置。经查找代码发现，只有执行inserter函数才会提示Discarded bad propagated block，而inserter函数被赋值给Fetcher.insertChain，最终在insert()中被调用：

func (f *Fetcher) insert(peer string, block *types.Block) {
	hash := block.Hash()

	// Run the import on a new thread
	log.Debug("Importing propagated block", "peer", peer, "number", block.Number(), "hash", hash)
	go func() {
		...
		if _, err := f.insertChain(types.Blocks{block}); err != nil {
			log.Debug("Propagated block import failed", "peer", peer, "number", block.Number(), "hash", hash, "err", err)
			return
		}
		// If import succeeded, broadcast the block
		propAnnounceOutTimer.UpdateSince(block.ReceivedAt)
		go f.broadcastBlock(block, false)

		// Invoke the testing hook if needed
		if f.importedHook != nil {
			f.importedHook(block)
		}
	}()
}

测试clique的时候，遇到同样的问题，具体原因未知，但是在clique中任意两个点运行admin.addPeer()，两两互联后，该问题消失。

后继续查找原因，发现pbft中使用了clique中的snap.Recents，该变量用于标记最近签名的signer，如果该signer近期签过名，后期便不能再次签名，会返回errUnauthorized，主要有以下三个地方存在这个问题（添加注释部分）：

func (c *PBFT) verifySeal(chain consensus.ChainReader, header *types.Header, parents []*types.Header) error {
	...
	// for seen, recent := range snap.Recents {
	// 	if recent == signer {
	// 		// Signer is among recents, only fail if the current block doesn't shift it out
	// 		if limit := uint64(len(snap.Signers)/2 + 1); seen > number-limit {
	// 			return errUnauthorized
	// 		}
	// 	}
	// }
	// Ensure that the difficulty corresponds to the turn-ness of the signer
	inturn := snap.inturn(header.Number.Uint64(), signer)
	if inturn && header.Difficulty.Cmp(diffInTurn) != 0 {
		return errInvalidDifficulty
	}
	if !inturn && header.Difficulty.Cmp(diffNoTurn) != 0 {
		return errInvalidDifficulty
	}
	return nil
}

func (c *PBFT) Seal(chain consensus.ChainReader, block *types.Block, stop <-chan struct{}) (*types.Block, error) {
	...
	//=> --Agzs
	// // If we're amongst the recent signers, wait for the next block
	// for seen, recent := range snap.Recents {
	// 	if recent == signer {
	// 		// Signer is among recents, only wait if the current block doesn't shift it out
	// 		if limit := uint64(len(snap.Signers)/2 + 1); number < limit || seen > number-limit {
	// 			log.Info("Signed recently, must wait for others")
	// 			<-stop
	// 			return nil, nil
	// 		}
	// 	}
	// }

	...
	// Sign all the things!
	sighash, err := signFn(accounts.Account{Address: signer}, sigHash(header, nil).Bytes()) //=> TODO. --Agzs
	...
	newBlock := block.WithSeal(header)
	
	c.pbft.recvRequestBlock(newBlock) 

	/// TODO: wait until the commit messages are received by 2/3 PBFT nodes, then return the block --Zhiguo
	select {
	case <-stop:
		log.Info("Stop PBFT algorithm!")
		return nil, nil
	case <-c.finishedChan:
		//c.pbft.lastExec = newBlock.Header().Number.Uint64() //=>TODO. lastExec. --Agzs
		return newBlock, nil
	}

	///        return nil, nil

}

func (s *Snapshot) apply(headers []*types.Header) (*Snapshot, error) {
	...
		// for _, recent := range snap.Recents {
		// 	if recent == signer {
		// 		return nil, errUnauthorized
		// 	}
		// }
	...
}

由于返回了errUnauthorized，其后语句无法正常进行，所以会造成该错误。

解决方法

注释掉上述三个地方的snap.Recents中返回errUnauthorized的for循环，如上所示。

解决掉上述问题后，signer1可以连续挖块，其他signer也可以import这些区块，包括加入通过addPeer()添加的node也可以import。

注意：

建议运行前删除存储rocksDB的db文件夹，因为运行测试的过程中必定存在错误，而一些错误的preprepare等其他消息将会保存到pset和qset，然后进一步保存到rocksDB，重新启动后，这些数据都会被重新读取，导致一些seqNo出现错误，程序无法正常运行，一定要注意这一点！！！！！！

10、权限问题

在问题9没解决的基础上，猜想问题9解决后，该问题应该会解决。

关闭各终端，然后重新运行geth，又遇到新问题：

signer1报错：

INFO [11-06|10:57:37] Successfully sealed new block            number=2 hash=205067…39e5d8
INFO [11-06|10:57:37] 🔨 mined potential block                  number=2 hash=205067…39e5d8
ERROR[11-06|10:57:37] Failed to prepare header for mining      err=unauthorized

signer2报错，其他signer类似：

INFO [11-06|10:57:37] Imported new chain segment               blocks=1 txs=0 mgas=0.000 elapsed=220.297µs mgasps=0.000 number=2 hash=205067…39e5d8
ERROR[11-06|10:57:37] Failed to prepare header for mining      err=unauthorized

如果正常的话，signer2应该import的blocks的数目为2，并且signer1不会出现未授权情况。

测试clique正常运行情况：

signer1运行结果：

> miner.start()
INFO [11-06|11:10:10] Transaction pool price threshold updated price=18000000000
INFO [11-06|11:10:10] Starting mining operation 
null
> INFO [11-06|11:10:10] Commit new mining work                   number=22 txs=0 uncles=0 elapsed=113.333µs
INFO [11-06|11:10:10] Successfully sealed new block            number=22 hash=af7e9b…d34f0c
INFO [11-06|11:10:10] 🔨 mined potential block                  number=22 hash=af7e9b…d34f0c
INFO [11-06|11:10:10] Commit new mining work                   number=23 txs=0 uncles=0 elapsed=408.973µs
> admin.addPeer("enode://996058eac555decf228b31699553d28fb1f65fb116f6d98b18b9f42d4d65498a4f9db808ace34211e3b7fad627f028f6c7fdd39b745aa3cdca2ca367ac51b362@127.0.0.1:2000")
INFO [11-06|11:10:20] Successfully sealed new block            number=23 hash=0c3286…2bd8eb
INFO [11-06|11:10:20] 🔨 mined potential block                  number=23 hash=0c3286…2bd8eb
INFO [11-06|11:10:20] Commit new mining work                   number=24 txs=0 uncles=0 elapsed=391.673µs
INFO [11-06|11:10:30] Successfully sealed new block            number=24 hash=383202…0d2002
INFO [11-06|11:10:30] 🔨 mined potential block                  number=24 hash=383202…0d2002
INFO [11-06|11:10:30] Commit new mining work                   number=25 txs=0 uncles=0 elapsed=379.078µs

true

node1运行结果：

INFO [11-06|11:10:38] Imported new chain segment               blocks=4 txs=0 mgas=0.000 elapsed=1.844ms mgasps=0.000 number=24 hash=383202…0d2002 ignored=4
INFO [11-06|11:10:40] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-06|11:10:40] handleMsg() ---NewBlockMsg----------- 
INFO [11-06|11:10:40] Read the next message from the remote peer in pm.handleMsg() 
INFO [11-06|11:10:40] handleMsg() ---NewBlockHashesMsg----------- 
INFO [11-06|11:10:40] Imported new chain segment               blocks=1 txs=0 mgas=0.000 elapsed=292.206µs mgasps=0.000 number=25 hash=c0e8f4…e9449d

解决方法问题9解决后，该问题不存在。。。。。。。