A User Centered Concept Mining System for Query and Document Understanding at Tencent - Qi-Ming/Read GitHub Wiki
Summary
Most prior studies which extracting formal and overly general concept form Wikipedia or static web page does not represent the user perspective. This paper describes their experience of implementing and deploying Concept in Tencent QQ Browser. It discovers user-centered concepts at the right granularity conforming to user interests, by mining a large mount of user queries and interactive search click log, thus being able to understand user intention by capturing their interaction with their content.
ConcepT extract candidate use-centered concepts from vast query logs by two unsupervised strategies : (1)a small number of predefined string patterns can be used to find new concepts while the found concepts can in turn be used to expand the pool of such patterns.(2)an import concept in a query would repeat itself in the document title clicked by the user that has input the query.
Writers train a supervised sequence labeling Conditional Random Field and a discriminator based on the initial seed concept set obtained, to generalize concept extraction and control the concept quality.
Writers propose effective strategies to tag document with potentially complex concepts to depict document coverage, mainly by two methods:(1) matching key instances in a document with their concepts if their isA relationships exists in the corresponding constructed taxonomy. (2)using a probabilistic inference framework to estimate the probability of a concept provided that an instance is observed in its context.
Writers have constructed and maintained a three-layered topic-concept-instance taxonomy, by identifying the isA relationship among instances, concepts and topics based on machine learning methods such as deep natural network and probabilistic models which help with query and document understanding at varying granularities.