Bookmarks - aragorn/home GitHub Wiki

자연어처리

databases

The time series database service
- https://tempo-db.com/
- TempoDB is purpose-built to store & analyze time series data from sensors, smart meters, servers & more

사내 검색 관련

http://cp.news.search.daum.net/partner/guide_qna 뉴스 CP 제휴
http://cp.news.search.daum.net/partner/guide_enterance
http://syndi-guide.search.daum.net:3000/ 동영상 제휴

Unix & Programming General

http://www.linusakesson.net/programming/tty/
TTY, process/job/session, signal 등의 관계
http://gcc.godbolt.org/
온라인에서 바로 C/C++ 코드를 컴파일해서 어셈블리 결과를 볼 수 있는 웹페이지
Sorting Algorithm Animations http://www.sorting-algorithms.com/
http://offbytwo.com/2011/06/26/things-you-didnt-know-about-xargs.html
http://mark.stosberg.com/blog/2010/12/percent-encoding-uris-in-perl.html
크게 보아 CGI.pm을 이용하는 방법과 URI::Escape를 이용하는 방법이 있는데, 둘 다 완벽하지는 않다

Data Visualization

http://mbostock.github.io/protovis/
http://www.colorzilla.com/gradient-editor/
Why Should Engineers and Scientists Be Worried About Color?
http://www.research.ibm.com/people/l/lloydt/color/color.HTM

hash library

Benchmarks and etc

Hash functions: An empirical comparison http://www.strchr.com/hash_functions
Meiyan http://www.sanmayce.com/Fastest_Hash/
Benchmarking CRC32 and PopCnt instructions http://www.strchr.com/crc32_popcnt
xxhash256 http://extrememoderate.wordpress.com/2012/06/10/xxhash256-update/
CRC16 - CRC64 test results on 18.2M dataset, w/program source http://www.backplane.com/matt/crc64.html
- Program & Test Run by Matt Dillon
- 18.2M message-id dataset supplied by Joe Greco
This article only discusses how to write a fast CRC32 algorithm in C/C++. http://create.stephan-brumme.com/crc32/
http://developers.blog.box.com/2011/10/12/crc32-checksums-the-good-the-bad-and-the-ugly/

SMHasher & MurmurHash v3

https://code.google.com/p/smhasher/

MurmurHash v2

https://sites.google.com/site/murmurhash/

Extremely simple - compiles down to ~52 instructions on x86. Excellent distribution - Passes chi-squared tests for practically all keysets & bucket sizes. Excellent avalanche behavior - Maximum bias is under 0.5%. Excellent collision resistance - Passes Bob Jenkin's frog.c torture-test. No collisions possible for 4-byte keys, no small (1- to 7-bit) differentials. Excellent performance - measured on an Intel Core 2 Duo @ 2.4 ghz

OneAtATime - 354.163715 mb/sec

FNV - 443.668038 mb/sec

SuperFastHash - 985.335173 mb/sec

lookup3 - 988.080652 mb/sec

MurmurHash 1.0 - 1363.293480 mb/sec

MurmurHash 2.0 - 2056.885653 mb/sec

jQuery

menu-aim is a jQuery plugin for dropdown menus that can differentiate between a user trying hover over a dropdown item vs trying to navigate into a submenu's contents.
- https://github.com/kamens/jQuery-menu-aim
- http://story.pxd.co.kr/655

Mathematics

Partially ordered set, poset https://en.wikipedia.org/wiki/Partially_ordered_set
- Partially ordered set을 이용하면, 웹문서의 작성일을 추정할 수 있다.
http://en.wikipedia.org/wiki/Learning_to_rank

Tips

Chrome Record modes and Playback modes
- Record modes let you record every request Chrome makes. Playback mode serves requests out of that recorded cache just as if they were being loaded on the spot. It doesn't record where you click or what you open, just every request as it moves over the wire. http://dev.hubspot.com/blog/bulletproof-demos
What should every programmer know about web development?
- http://programmers.stackexchange.com/questions/46716/what-should-every-programmer-know-about-web-development/46760#46760

Benchmarks

HTTP Client Performance – IO http://blogs.atlassian.com/2013/07/http-client-performance-io/

HTTP

Web Development

SEO - Search Engine Optimization

Making AJAX Applications Crawlable
- Hash fragment - https://developers.google.com/webmasters/ajax-crawling/docs/specification?csw=1

Traditionally, hash fragments (that is, everything after # in the URL) have been used to indicate one portion of a static HTML document. By contrast, AJAX applications often use hash fragments in another function, namely to indicate state. For example, when a user navigates to the URL http://www.example.com/ajax.html#key1=value1&key2=value2, the AJAX application will parse the hash fragment and move the application to the "key1=value1&key2=value2" state. This is similar in spirit to moving to a portion of a static document, that is, the traditional use of hash fragments. History (the back button) in AJAX applications is generally handled with these hash fragments as well. Why are hash fragments used in this way? While the same effect could often be achieved with query parameters (for example, ?key1=value1&key2=value2), hash fragments have the advantage that in and of themselves, they do not incur an HTTP request and thus no round-trip from the browser to the server and back. In other words, when navigating from www.example.com/ajax.html to www.example.com/ajax.html#key1=value1&key2=value2, the web application moves to the state key1=value1&key2=value2 without a full page refresh. As such, hash fragments are an important tool in making AJAX applications fast and responsive. Importantly, however, hash fragments are not part of HTTP requests (and as a result they are not sent to the server), which is why our approach must handle them in a new way. See RFC 3986 for more details on hash fragments.

contents syndication protocol

sitemaps - http://www.sitemaps.org/protocol.html
어떤 사이트의 컨텐츠를 검색에 노출하고자 하는 경우, sitemaps 프로토콜을 이용하면 전체 컨텐츠 리스트를 내보낼 수 있다. sitemaps 파일을 여러개 만들 수 있으며, ping 프로토콜을 통해 검색엔진에 자신의 sitemaps 가 갱신되었다는 신호를 보낼 수도 있다. 네이버 컨텐츠 신디케이션 프로토콜보다 더 나은 것 같다. --김정겸
naver syndication api - http://dev.naver.com/openapi/apis/function/syndication

Web search and crawler

Lectures

http://cs.nyu.edu/courses/spring11/G22.2580-001/

We will discuss the design of a Web search engine and the extraction of information off the Web. Topics include Web crawlers. Database design. Query language. Relevance ranking Document Similarity and Clustering The "invisible" Web Specialized search engines Evaluation. Natural Language Processing The structure of the web Web content mining Web usage mining Business model: Pricing advertizing Multi-media retrieval. Multilingual retrieval.

Textbooks

Mining of Massive Datasets by Anand Rajaraman (@anand_raj) and Jeff Ullman
- http://infolab.stanford.edu/~ullman/mmds.html
- Finding Similar Items http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Wikipedia

http://en.wikipedia.org/wiki/Distributed_web_crawling

For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur. To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002).

Cho98, Efficient Crawling Through URL Ordering

Cho, J.; Garcia-Molina, H.; Page, L. (1998-04). "Efficient Crawling Through URL Ordering"
- http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
크게 세 가지 접근 전략을 소개하고 있다.
- topic과 keyword들을 먼저 정해 놓은 후, 웹페이지 내에 이 키워드들의 매칭 점수가 높은 경우, 우선하여 방문한다. 특히 anchor text에 이 단어들이 매칭되는 경우 높은 가중치를 준다. 이 방법은 실질적으로, 검색랭킹 점수를 바로 계산하여서, 검색랭킹에서 상위에 놓일 문서들을 우선적으로 가져오는 방식이며, 타 검색서비스의 검색결과를 수집하는 것을 부분적으로 대체 가능한 방법이다. 특정 주제를 우선적으로 수집하여야 하는 경우, 이 방법을 쓸 수 있다.
- backlink count를 계산한 후, backlink count가 높은 경우 우선하여 방문한다.
- pagerank를 계산한 후, pagerank가 높은 경우 우선하여 방문한다.
Performance 평가에 있어서, 랜덤 크롤링이나 너비 우선 크롤링 방식과 대비하여, pagerank가 높은 hot page를 얼마나 빨리 방문하느냐를 metric으로 삼고 있다. hot page를 정의할 때, pagerank가 높은 문서가 hot page가 되기 때문에, performance 평가에 있어서, 우선하여 방문하겠다고 선택한 것들을 빨리 가져오는지 평가하고 있는 셈이다. 당연히 이렇게 되어야 할 것 같은데.
논문의 마지막 부분 결론

In general our results show that PageRank, IR’(P), is an excellent ordering metric when either pages with many backlinks or with high PageRank are sought. In addition, if the similarity to a driving query is important, then it is also useful to visit earlier URLs that:

Have anchor text that is similar to the driving query;

Have some of the query terms within the URL itself; or

Have a short link distance to a page that is known to be hot.

With a good ordering strategy, it seems to be possible to build crawlers that can rather quickly obtain a significant portion of the hot pages. This can be extremely useful when we are trying to crawl large portions of the Web, when are resources are limited, or when we need to revisit pages often to detect changes.

Cho2001, Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data

Cho, Junghoo, "Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data"
- http://oak.cs.ucla.edu/~cho/papers/cho-thesis.pdf
박사 논문이어서 본문 양이 꽤 된다. 문서 변경, 재방문 주기, 이미 데이터가 꽉 차 있을 때 어떻게 데이터 수집하여 저장할 것인지 등 여러 이슈를 다루고 있다.

Robots Exclusion Standard

http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
http://en.wikipedia.org/wiki/Sitemaps
웹로봇의 정의
- 오프라인 브라우저와 무슨 차이?
- 수집과 검색 노출의 차이

지금까지 살펴본 바와 같이 검색엔진을 통한 “링크 문제”는 저작권 관련 문제에 있어서 중요한 이슈중 하나임에는 틀림이 없다. 특히 썸네일 이미지의 사용은 우리나라에서도 문제가 되었었다. 그러나 썸네일 이미지에 대해 우리 대법원은 썸네일 판결10)을 통해서

검색사이트에 썸네일 이미지의 형태로 게시된 공소외인의 사진작품들은 공소외인의 개인 홈페이지에서 이미 공표된 것인 점,

피고인 회사가 썸네일 이미지를 제공한 주요한 목적은 상업적인 성격은 간접적이고 부차적인 것에 불과한 점,

썸네일 이미지는 일반 사진작품의 심미적이고 예술적인 목적을 가지고 있지 않고 사진의 본질적인 면에 서 사용한 것으로는 보기 어려운 점,

썸네일이미지를 게시한 것이 공소외인의 작품사 진에 대한 수요를 대체한다거나 공소외인의 사진 저작물에 대한 저작권침해의 가능성을 높이는 것으로 보기는 어려운 점,

이미지 검색을 이용하는 사용자들도 썸네일 이미지 를 작품사진으로 감상하기보다는 이미지와 관련된 사이트를 찾아가는 통로로 인식할 가능성이 높은 점 및 썸네일 이미지의 사용은 검색사이트를 이용하는 사용자들에게 보다 완결된 정보를 제공하기 위한 공익적 측면이 강한 점

등을 이유로 썸네일의 사용은 정 당한 범위안에서 공정한 관행에 합치되게 사용한 것이고 판시하였다.

대법원 2006.2.9. 선고 2005도7793 판결

도메인

한글 도메인 관련 : IDN
- http://en.wikipedia.org/wiki/Internationalized_domain_name

Coding Style

생략

Documentation

프로그램, 소스코드에 대한 문서는 1) 코드 파일 내 내장하는 방법을 우선적으로 선택하고, 2) 곤란한 경우, 별도 위키 문서로 작성합니다.
문서를 작성할 때에는 일반적인 오픈소스 프로그램의 문서 수준에 준하여 작성합니다.
Python - pydoc
Perl - perlpod, the Plain Old Documentation format
- ex1 - http://search.cpan.org/~codechild/XML-Bare/Bare.pm
- ex2 - http://search.cpan.org/~mattp/Test-WWW-Selenium/lib/WWW/Selenium.pm

Web Redering

http://en.wikipedia.org/wiki/Web_browser_engine
http://phantomjs.org/
- PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.
Selenium http://docs.seleniumhq.org/
- Selenium automates browsers. That's it. What you do with that power is entirely up to you. Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.
Python selenium binding http://selenium.googlecode.com/git/docs/api/py/index.html
Perl Selenium Binding http://search.cpan.org/~mattp/Test-WWW-Selenium-1.36/lib/WWW/Selenium.pm
Webkit Perl Binding 에 대한 소개자료, html presentation http://potyl.github.io/Talk-WebKit-Perl/ - 화살표 키로 화면 넘기면 됩니다.

아이디어

몇가지 자료조사를 해 본 바로는 phantomjs 가 간단하고 사용하기 편리해 보인다. 임의의 javascript code를 실행하기도 편리하고, python, perl 등 여러 개발언어에서 binding 을 지원하는 장점도 있다. 우분투 ?? 버전에서 phantomjs 1.4.0+dfsg-1 버전이 바이너리 패키지로 제공된다. --김정겸, 2013-08-22

Bookmarks - aragorn/home GitHub Wiki

자연어처리

머신러닝

정리 안 됨 1

정리 안 됨 2

databases

사내 검색 관련

Unix & Programming General

Data Visualization

hash library

Benchmarks and etc

SMHasher & MurmurHash v3

MurmurHash v2

jQuery

Mathematics

Tips

Benchmarks

HTTP

Web Development

SEO - Search Engine Optimization

contents syndication protocol

Web search and crawler

Lectures

Textbooks

Wikipedia

Cho98, Efficient Crawling Through URL Ordering

Cho2001, Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data

Robots Exclusion Standard

도메인

Coding Style

Documentation

Web Redering

아이디어