[엘라스틱서치] 자동완성 방식에 스페이스를 구분할 것인가? - forewalk/elastic GitHub Wiki

Elasticsearch

자동완성

자동완성을 만드는 방법은 여러가지가 있다. 이전에 소개한 n-gram을 이용하는 방식도 가능하고, suggest를 이용한 방식도 가능하다. 각각의 장단점이 있으며, 사이트 특성에 맞게 사용하면 될 듯 하다. 이번엔 자동완성을 만드는 방법이 아닌, 자동완성에 스페이스가 들어가는경우 어떻게 처리할 것인가의 문제를 얘기해보고자 한다. 색인되어 있는 데이터에 만약 주소처럼 스페이스가 포함된 데이터가 있다고 예를 들어보자. "경기도 파주시 야당동" 이 데이터의 자동완성 방법은 구글과 네이버가 다른 방식을 취하고 있다.

구글의 경우: "경기도"까지는 자동완성이지만, 스페이스를 무시한채 "경기도파" 까지 입력하게 되면 자동완성이 사라지는 것을 볼 수 있다. 이는 자동완성에 스페이스를 문자열로 처리하고 있다는 의미이며, 네이버의 경우 다르다.

네이버의 경우: "경기도"까지 자동완성처리가 되며, 스페이스를 무시한채 "경기도파"까지 입력하여도 "경기도 파주시 야당동"이 결과로 나타나게 된다. 이는 원장데이터를 여러개로 분리되어 색인된 결과가 아닌(물론 그렇게 할수도 있겠지만), analyzer를 통해 스페이스를 붙여처리한 것으로 볼 수 있다.

두 사이트의 관점의 차이라고 볼 수 있을 듯 한데, 구글은 "정확도"에 초점을 네이버는 "사용자친화적"에 초점을 맞춘 듯하다.

ES에서도 같은 처리를 진행할 수 있다. 토큰 필터 중, 지붕필터(이름이 재미있다) 라고 하는데, 네이보한 데이터의 분절끼리 어떻게 처리할 것인가를 analyzer한 것이라 볼 수 있다. 참고

만약 해당을 인덱싱할 때, 매핑으로 처리한다면 index-phrases를 사용할 수도 있다. 참고

입력값의 스페이스를 어떻게 처리할 것인가? 단순하지만, 고객의 요구사항이 나올 수 있는 내용이다.

PUT dmap_map_nation_datamap_open_20200108
{
    "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "keyword"
        },
        "brmCls1Nm" : {
          "type" : "keyword"
        },
        "brmCls2Nm" : {
          "type" : "keyword"
        },
        "dataNodeId" : {
          "type" : "keyword"
        },
        "docContent" : {
          "type" : "text",
          "analyzer" : "nori_analyzer",
          "fielddata" : true
        },
        "docId" : {
          "type" : "keyword"
        },
        "docTitle" : {
          "type" : "text",
          "fields" : {
            "completion" : {
              "type" : "completion",
              "analyzer" : "complete_analyzer",
              "preserve_separators" : true,
              "preserve_position_increments" : true,
              "max_input_length" : 50
            }
          },
          "analyzer" : "nori_analyzer",
          "fielddata" : true
        },
        "holdColList" : {
          "type" : "text"
        },
        "holdColListArray" : {
          "type" : "keyword"
        },
        "lodDt" : {
          "type" : "date"
        },
        "openDataYn" : {
          "type" : "keyword"
        },
        "openHoldColList" : {
          "type" : "text"
        },
        "openHoldColListArray" : {
          "type" : "keyword",
          "fields" : {
            "chosung" : {
              "type" : "text",
              "analyzer" : "chosung_index_analyzer",
              "search_analyzer" : "chosung_search_analyzer"
            },
            "eng2kor" : {
              "type" : "text",
              "analyzer" : "standard",
              "search_analyzer" : "eng2kor_analyzer"
            },
            "nori" : {
              "type" : "text",
              "analyzer" : "nori_analyzer"
            }
          }
        },
        "orgCd" : {
          "type" : "keyword"
        },
        "orgNm" : {
          "type" : "keyword"
        },
        "orgNmArray" : {
          "type" : "keyword",
          "fields" : {
            "chosung" : {
              "type" : "text",
              "analyzer" : "chosung_index_analyzer",
              "search_analyzer" : "chosung_search_analyzer"
            },
            "eng2kor" : {
              "type" : "text",
              "analyzer" : "standard",
              "search_analyzer" : "eng2kor_analyzer"
            },
            "nori" : {
              "type" : "text",
              "analyzer" : "nori_analyzer"
            }
          }
        },
        "searchKeyword" : {
          "type" : "keyword"
        },
        "searchKeywordArray" : {
          "type" : "keyword",
          "fields" : {
            "chosung" : {
              "type" : "text",
              "analyzer" : "chosung_index_analyzer",
              "search_analyzer" : "chosung_search_analyzer"
            },
            "eng2kor" : {
              "type" : "text",
              "analyzer" : "standard",
              "search_analyzer" : "eng2kor_analyzer"
            },
            "nori" : {
              "type" : "text",
              "analyzer" : "nori_analyzer"
           }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "max_ngram_diff" : "20",
        "max_shingle_diff": "5",
        "analysis" : {
          "filter" : {
            "edge_ngram_front" : {
              "min_gram" : "1",
              "side" : "front",
              "type" : "edgeNGram",
              "max_gram" : "20"
            },
            "synonym" : {
              "type" : "synonym",
              "synonyms_path" : "dictionary/synonyms.txt"
            },
            "stop" : {
              "type" : "stop",
              "stopwords_path" : "dictionary/stopwords.txt"
            },
            "shingle":{
              "type": "shingle",
              "output_unigrams":true,
              "token_separator":"",
              "max_shingle_size":5
            },
            "chosung" : {
              "type" : "javacafe_chosung"
            },
            "nori_stop" : {
              "type" : "nori_part_of_speech",
              "stoptags" : [
                "E",
                "IC",
                "J",
                "MAG",
                "MM",
                "NR",
                "SF",
                "SH",
                "SP",
                "SL",
                "SN",
                "SSC",
                "SSO",
                "SC",
                "SY",
                "UNKNOWN",
                "SE",
                "XPN",
                "XSN",
                "VA",
                "VCN",
                "VCP",
                "VV",
                "VX",
                "XPN",
                "XR",
                "XSA",
                "XSV",
                "UNA",
                "NA",
                "VSV"
              ]
            }
          },
          "analyzer" : {
            "chosung_index_analyzer" : {
              "filter" : [
                "chosung",
                "lowercase",
                "trim",
                "edge_ngram_front"
              ],
              "type" : "custom",
              "tokenizer" : "keyword"
            },
            "eng2kor_analyzer" : {
              "filter" : [
                "trim",
                "lowercase",
                "javacafe_eng2kor",
                "synonym",
                "stop"
              ],
              "type" : "custom",
              "tokenizer" : "standard"
            },
            "nori_analyzer" : {
              "filter" : [
                "nori_stop",
                "lowercase",
                "trim",
                "stop",
                "synonym"
              ],
              "type" : "custom",
              "tokenizer" : "nori_tokenizer"
            },
            "chosung_search_analyzer" : {
              "filter" : [
                "chosung",
                "lowercase",
                "trim"
              ],
              "type" : "custom",
              "tokenizer" : "keyword"
            },
            "complete_analyzer" : {
              "filter" : [
                "trim",
                "lowercase",
                "shingle"
              ],
              "type" : "custom",
              "tokenizer" : "letter"
            }
          },
          "tokenizer" : {
            "ngram_tokenizer" : {
              "token_chars" : [
                "letter",
                "digit",
                "punctuation",
                "symbol"
              ],
              "min_gram" : "1",
              "type" : "ngram",
              "max_gram" : "20"
            },
            "nori_tokenizer" : {
              "mode" : "MIXED",
              "type" : "nori_tokenizer",
              "user_dictionary" : "dictionary/userdic.txt"
            }
          }
        }
      }
    }
  }
}