[01] [Parse] lexer refactor bom handling - omochi/swift GitHub Wiki

PR下書き

現在の Lexer には UTF-8 BOM を読み飛ばす機能があるが、その挙動に関して以下のバグがある。

ファイル先頭の shebang を読み飛ばす機能があるが、 BOM があると読み飛ばされない。
ファイル先頭にある prefix operator とするべきトークンが、 BOM があると infix operator となる。
コンフリクトマーカーを読み飛ばす機能が、 BOM があると読み飛ばされない。
ファイル先頭にあるトークンが isAtStartOfLine となるべきだが、 BOM があるとならない。

これらのバグはいずれも、 Lexer の実装において、 BOM を読み飛ばした場合であっても、 BufferStart をソースコードの先頭とみなしていることに起因している。 BOM があるテキストファイルにおいては、 BOM の終わりがテキストコンテンツの先頭とみなされるべきだ。

この PR はこの問題の修正しつつ、 UTF-8 BOM の取り扱いを改善する。

設計を以下に示す。

テキストファイルにおいて、普通はテキストコンテンツの先頭はファイル先頭だが、 BOM がある場合には BOM 文字の終わりがテキストコンテンツの先頭となる。これをあらわすために、 TextContentStart というフィールドを追加する。

補足

BOM + shebang のバグは、 libParse と libSyntax の shebang スキップロジックの両方に存在する。この PR では libParse のみを修正する。

私は libSyntax については、将来的に BOM を LeadingTrivia として保持するべきと考えているが、これを考慮すると変更が大きくなるので、これは別の PR として後日提出する予定である。

Current Lexer has a function to skip UTF-8 BOM. But its behavior has some bugs.

It skips shebang at the beginning of file, but does not with BOM.
A token should be prefix operator at the beginning of file, but is become infix operator with BOM.
It skips conflict marker, but does not with BOM.
A token at the beginning of file should be isAtStartOfLine, but is not with BOM.

All these bugs are come from implementation of Lexer that BufferStart is regarded as the beginning of source code even if BOM was skipped. In text file with BOM, the end of BOM should be regarded as the beginnig of text content.

This PR fix these issues.

A design is shown below.

In text file, the beginning of text content is ordinary the beginning of file. But if there is a BOM, the end of BOM is.

So to represent the beginning of text content, ContentStart field is added to Lexer.

note

Two bugs about BOM + shebang are at skipping logic in libParse and libSyntax(Lexer::lexTrivia) both. This PR only fix and add testcase for libParse.

I think that libSyntax should not skip BOM and should keep it as LeadingTrivia. But to implement this idea needs more changes, So I plan to submit it as another PR in future.

ところで、 SwiftSyntax は、将来的に BOM を LeadingTrivia として保持するべきだろう。ソースファイル編集においては、 BOM の保存や削除なども制御できるべきだからだ。 SwiftSyntax が他のコメント文などと同様に、 BOM を取り扱うためには、 lexTrivia の突入時点で CurPtr が BOM より手前をさしていなければならない。そこで、 BOM を読み飛ばす処理は、コンストラクタから lexImpl に移動する。ただし、ファイル冒頭に BOM があるかどうかは、 Lexer の状態と関係なく静的な事実であるので、これまでどおりコンストラクタで読み取ればよい。

SwiftSyntax の BOM 保持は別 PR にするメモ

  NextToken.setAtStartOfLine(CurPtr == BufferStart);

  // Remember where we started so that we can find the comment range.
  if (CurPtr == BufferStart) {
    // AttachToNextToken mode does not take BOM as comment.
    LastCommentBlockStart = ContentStart;
  } else {
    LastCommentBlockStart = CurPtr;
  }
  SeenComment = false;

Restart:
  lexTrivia(LeadingTrivia, /* IsForTrailingTrivia */ false);

  // This BOM eating is also performed in lexTrivia.
  // But this is no problem to do it twice.
  if (CurPtr == BufferStart) {
    // skip UTF-8 BOM
    CurPtr = ContentStart;
  }


TEST_F(LexerTest, ContentStartHashbangSkipTrivia) {
  using namespace swift::syntax;
  
  const char *Source = "#!/usr/bin/swift\naaa";
  
  LangOptions LangOpts;
  SourceManager SourceMgr;
  unsigned BufferID = SourceMgr.addMemBufferCopy(StringRef(Source));
  
  Lexer L(LangOpts, SourceMgr, BufferID, /*Diags=*/nullptr, /*InSILMode=*/false,
          CommentRetentionMode::AttachToNextToken, TriviaRetentionMode::WithTrivia);
  
  Token Tok;
  Trivia LeadingTrivia, TrailingTrivia;
  
  L.lex(Tok, LeadingTrivia, TrailingTrivia);
  ASSERT_EQ(LeadingTrivia,
            (Trivia{{ TriviaPiece::garbageText("#!/usr/bin/swift"),
                      TriviaPiece::newlines(1) }}));
  ASSERT_EQ(Tok.getKind(), tok::identifier);
  ASSERT_EQ(Tok.getText(), "aaa");
}

TEST_F(LexerTest, ContentStartHashbangSkipTriviaUTF8BOM) {
  using namespace swift::syntax;
  
  const char *Source = "\xEF\xBB\xBF" "#!/usr/bin/swift\naaa";
  
  LangOptions LangOpts;
  SourceManager SourceMgr;
  unsigned BufferID = SourceMgr.addMemBufferCopy(StringRef(Source));
  
  Lexer L(LangOpts, SourceMgr, BufferID, /*Diags=*/nullptr, /*InSILMode=*/false,
          CommentRetentionMode::AttachToNextToken, TriviaRetentionMode::WithTrivia);
  
  Token Tok;
  Trivia LeadingTrivia, TrailingTrivia;
  
  L.lex(Tok, LeadingTrivia, TrailingTrivia);
  ASSERT_EQ(LeadingTrivia,
            (Trivia{{ TriviaPiece::garbageText("\xEF\xBB\xBF" "#!/usr/bin/swift"),
                      TriviaPiece::newlines(1) }}));
  ASSERT_EQ(Tok.getKind(), tok::identifier);
  ASSERT_EQ(Tok.getText(), "aaa");
}