Bookmarks - aragorn/home GitHub Wiki

์ž์—ฐ์–ด์ฒ˜๋ฆฌ

๋จธ์‹ ๋Ÿฌ๋‹

์ •๋ฆฌ ์•ˆ ๋จ 1

์ •๋ฆฌ ์•ˆ ๋จ 2

databases

  • The time series database service
    • https://tempo-db.com/
    • TempoDB is purpose-built to store & analyze time series data from sensors, smart meters, servers & more

์‚ฌ๋‚ด ๊ฒ€์ƒ‰ ๊ด€๋ จ

Unix & Programming General

Data Visualization

hash library

Benchmarks and etc

SMHasher & MurmurHash v3

MurmurHash v2

https://sites.google.com/site/murmurhash/

Extremely simple - compiles down to ~52 instructions on x86. Excellent distribution - Passes chi-squared tests for practically all keysets & bucket sizes. Excellent avalanche behavior - Maximum bias is under 0.5%. Excellent collision resistance - Passes Bob Jenkin's frog.c torture-test. No collisions possible for 4-byte keys, no small (1- to 7-bit) differentials. Excellent performance - measured on an Intel Core 2 Duo @ 2.4 ghz

  • OneAtATime - 354.163715 mb/sec
  • FNV - 443.668038 mb/sec
  • SuperFastHash - 985.335173 mb/sec
  • lookup3 - 988.080652 mb/sec
  • MurmurHash 1.0 - 1363.293480 mb/sec
  • MurmurHash 2.0 - 2056.885653 mb/sec

jQuery

Mathematics

Tips

Benchmarks

HTTP

Web Development

SEO - Search Engine Optimization

Traditionally, hash fragments (that is, everything after # in the URL) have been used to indicate one portion of a static HTML document. By contrast, AJAX applications often use hash fragments in another function, namely to indicate state. For example, when a user navigates to the URL http://www.example.com/ajax.html#key1=value1&key2=value2, the AJAX application will parse the hash fragment and move the application to the "key1=value1&key2=value2" state. This is similar in spirit to moving to a portion of a static document, that is, the traditional use of hash fragments. History (the back button) in AJAX applications is generally handled with these hash fragments as well. Why are hash fragments used in this way? While the same effect could often be achieved with query parameters (for example, ?key1=value1&key2=value2), hash fragments have the advantage that in and of themselves, they do not incur an HTTP request and thus no round-trip from the browser to the server and back. In other words, when navigating from www.example.com/ajax.html to www.example.com/ajax.html#key1=value1&key2=value2, the web application moves to the state key1=value1&key2=value2 without a full page refresh. As such, hash fragments are an important tool in making AJAX applications fast and responsive. Importantly, however, hash fragments are not part of HTTP requests (and as a result they are not sent to the server), which is why our approach must handle them in a new way. See RFC 3986 for more details on hash fragments.

contents syndication protocol

  • sitemaps - http://www.sitemaps.org/protocol.html
    ์–ด๋–ค ์‚ฌ์ดํŠธ์˜ ์ปจํ…์ธ ๋ฅผ ๊ฒ€์ƒ‰์— ๋…ธ์ถœํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒฝ์šฐ, sitemaps ํ”„๋กœํ† ์ฝœ์„ ์ด์šฉํ•˜๋ฉด ์ „์ฒด ์ปจํ…์ธ  ๋ฆฌ์ŠคํŠธ๋ฅผ ๋‚ด๋ณด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. sitemaps ํŒŒ์ผ์„ ์—ฌ๋Ÿฌ๊ฐœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์œผ๋ฉฐ, ping ํ”„๋กœํ† ์ฝœ์„ ํ†ตํ•ด ๊ฒ€์ƒ‰์—”์ง„์— ์ž์‹ ์˜ sitemaps ๊ฐ€ ๊ฐฑ์‹ ๋˜์—ˆ๋‹ค๋Š” ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ผ ์ˆ˜๋„ ์žˆ๋‹ค. ๋„ค์ด๋ฒ„ ์ปจํ…์ธ  ์‹ ๋””์ผ€์ด์…˜ ํ”„๋กœํ† ์ฝœ๋ณด๋‹ค ๋” ๋‚˜์€ ๊ฒƒ ๊ฐ™๋‹ค. --๊น€์ •๊ฒธ
  • naver syndication api - http://dev.naver.com/openapi/apis/function/syndication

Web search and crawler

Lectures

We will discuss the design of a Web search engine and the extraction of information off the Web. Topics include Web crawlers. Database design. Query language. Relevance ranking Document Similarity and Clustering The "invisible" Web Specialized search engines Evaluation. Natural Language Processing The structure of the web Web content mining Web usage mining Business model: Pricing advertizing Multi-media retrieval. Multilingual retrieval.

Textbooks

Wikipedia

For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur. To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002).

Cho98, Efficient Crawling Through URL Ordering

  • Cho, J.; Garcia-Molina, H.; Page, L. (1998-04). "Efficient Crawling Through URL Ordering"
  • ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ์ ‘๊ทผ ์ „๋žต์„ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค.
    • topic๊ณผ keyword๋“ค์„ ๋จผ์ € ์ •ํ•ด ๋†“์€ ํ›„, ์›นํŽ˜์ด์ง€ ๋‚ด์— ์ด ํ‚ค์›Œ๋“œ๋“ค์˜ ๋งค์นญ ์ ์ˆ˜๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ, ์šฐ์„ ํ•˜์—ฌ ๋ฐฉ๋ฌธํ•œ๋‹ค. ํŠนํžˆ anchor text์— ์ด ๋‹จ์–ด๋“ค์ด ๋งค์นญ๋˜๋Š” ๊ฒฝ์šฐ ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ค€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์‹ค์งˆ์ ์œผ๋กœ, ๊ฒ€์ƒ‰๋žญํ‚น ์ ์ˆ˜๋ฅผ ๋ฐ”๋กœ ๊ณ„์‚ฐํ•˜์—ฌ์„œ, ๊ฒ€์ƒ‰๋žญํ‚น์—์„œ ์ƒ์œ„์— ๋†“์ผ ๋ฌธ์„œ๋“ค์„ ์šฐ์„ ์ ์œผ๋กœ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ์‹์ด๋ฉฐ, ํƒ€ ๊ฒ€์ƒ‰์„œ๋น„์Šค์˜ ๊ฒ€์ƒ‰๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์„ ๋ถ€๋ถ„์ ์œผ๋กœ ๋Œ€์ฒด ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ํŠน์ • ์ฃผ์ œ๋ฅผ ์šฐ์„ ์ ์œผ๋กœ ์ˆ˜์ง‘ํ•˜์—ฌ์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ, ์ด ๋ฐฉ๋ฒ•์„ ์“ธ ์ˆ˜ ์žˆ๋‹ค.
    • backlink count๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„, backlink count๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ ์šฐ์„ ํ•˜์—ฌ ๋ฐฉ๋ฌธํ•œ๋‹ค.
    • pagerank๋ฅผ ๊ณ„์‚ฐํ•œ ํ›„, pagerank๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ ์šฐ์„ ํ•˜์—ฌ ๋ฐฉ๋ฌธํ•œ๋‹ค.
  • Performance ํ‰๊ฐ€์— ์žˆ์–ด์„œ, ๋žœ๋ค ํฌ๋กค๋ง์ด๋‚˜ ๋„ˆ๋น„ ์šฐ์„  ํฌ๋กค๋ง ๋ฐฉ์‹๊ณผ ๋Œ€๋น„ํ•˜์—ฌ, pagerank๊ฐ€ ๋†’์€ hot page๋ฅผ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ ๋ฐฉ๋ฌธํ•˜๋А๋ƒ๋ฅผ metric์œผ๋กœ ์‚ผ๊ณ  ์žˆ๋‹ค. hot page๋ฅผ ์ •์˜ํ•  ๋•Œ, pagerank๊ฐ€ ๋†’์€ ๋ฌธ์„œ๊ฐ€ hot page๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ์—, performance ํ‰๊ฐ€์— ์žˆ์–ด์„œ, ์šฐ์„ ํ•˜์—ฌ ๋ฐฉ๋ฌธํ•˜๊ฒ ๋‹ค๊ณ  ์„ ํƒํ•œ ๊ฒƒ๋“ค์„ ๋นจ๋ฆฌ ๊ฐ€์ ธ์˜ค๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ณ  ์žˆ๋Š” ์…ˆ์ด๋‹ค. ๋‹น์—ฐํžˆ ์ด๋ ‡๊ฒŒ ๋˜์–ด์•ผ ํ•  ๊ฒƒ ๊ฐ™์€๋ฐ.
  • ๋…ผ๋ฌธ์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„ ๊ฒฐ๋ก 

In general our results show that PageRank, IRโ€™(P), is an excellent ordering metric when either pages with many backlinks or with high PageRank are sought. In addition, if the similarity to a driving query is important, then it is also useful to visit earlier URLs that:

  • Have anchor text that is similar to the driving query;
  • Have some of the query terms within the URL itself; or
  • Have a short link distance to a page that is known to be hot.

With a good ordering strategy, it seems to be possible to build crawlers that can rather quickly obtain a significant portion of the hot pages. This can be extremely useful when we are trying to crawl large portions of the Web, when are resources are limited, or when we need to revisit pages often to detect changes.

Cho2001, Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data

  • Cho, Junghoo, "Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data"
  • ๋ฐ•์‚ฌ ๋…ผ๋ฌธ์ด์–ด์„œ ๋ณธ๋ฌธ ์–‘์ด ๊ฝค ๋œ๋‹ค. ๋ฌธ์„œ ๋ณ€๊ฒฝ, ์žฌ๋ฐฉ๋ฌธ ์ฃผ๊ธฐ, ์ด๋ฏธ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฝ‰ ์ฐจ ์žˆ์„ ๋•Œ ์–ด๋–ป๊ฒŒ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ํ•˜์—ฌ ์ €์žฅํ•  ๊ฒƒ์ธ์ง€ ๋“ฑ ์—ฌ๋Ÿฌ ์ด์Šˆ๋ฅผ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค.

Robots Exclusion Standard

  • http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
  • http://en.wikipedia.org/wiki/Sitemaps
  • ์›น๋กœ๋ด‡์˜ ์ •์˜
    • ์˜คํ”„๋ผ์ธ ๋ธŒ๋ผ์šฐ์ €์™€ ๋ฌด์Šจ ์ฐจ์ด?
    • ์ˆ˜์ง‘๊ณผ ๊ฒ€์ƒ‰ ๋…ธ์ถœ์˜ ์ฐจ์ด
  • ์ €์ž‘๊ถŒ ์ด์Šˆ
    • ๊ฒ€์ƒ‰๊ฒฐ๊ณผ์— ๋…ธ์ถœํ•˜๋Š” ๊ฒฝ์šฐ, ์ €์ž‘๊ถŒ์ž์˜ ํ—ˆ๋ฝ์ด๋‚˜ ๋™์˜๊ฐ€ ์žˆ์–ด์•ผ ํ•˜๋‚˜?
  • ๊ฒ€์ƒ‰์—”์ง„๊ณผ ์ €์ž‘๊ถŒ์ œ๋„, 2008๋…„ ์ œ34ํ˜ธ

์ง€๊ธˆ๊นŒ์ง€ ์‚ดํŽด๋ณธ ๋ฐ”์™€ ๊ฐ™์ด ๊ฒ€์ƒ‰์—”์ง„์„ ํ†ตํ•œ โ€œ๋งํฌ ๋ฌธ์ œโ€๋Š” ์ €์ž‘๊ถŒ ๊ด€๋ จ ๋ฌธ์ œ์— ์žˆ์–ด์„œ ์ค‘์š”ํ•œ ์ด์Šˆ์ค‘ ํ•˜๋‚˜์ž„์—๋Š” ํ‹€๋ฆผ์ด ์—†๋‹ค. ํŠนํžˆ ์ธ๋„ค์ผ ์ด๋ฏธ์ง€์˜ ์‚ฌ์šฉ์€ ์šฐ๋ฆฌ๋‚˜๋ผ์—์„œ๋„ ๋ฌธ์ œ๊ฐ€ ๋˜์—ˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ธ๋„ค์ผ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์šฐ๋ฆฌ ๋Œ€๋ฒ•์›์€ ์ธ๋„ค์ผ ํŒ๊ฒฐ10)์„ ํ†ตํ•ด์„œ

  1. ๊ฒ€์ƒ‰์‚ฌ์ดํŠธ์— ์ธ๋„ค์ผ ์ด๋ฏธ์ง€์˜ ํ˜•ํƒœ๋กœ ๊ฒŒ์‹œ๋œ ๊ณต์†Œ์™ธ์ธ์˜ ์‚ฌ์ง„์ž‘ํ’ˆ๋“ค์€ ๊ณต์†Œ์™ธ์ธ์˜ ๊ฐœ์ธ ํ™ˆํŽ˜์ด์ง€์—์„œ ์ด๋ฏธ ๊ณตํ‘œ๋œ ๊ฒƒ์ธ ์ ,
  2. ํ”ผ๊ณ ์ธ ํšŒ์‚ฌ๊ฐ€ ์ธ๋„ค์ผ ์ด๋ฏธ์ง€๋ฅผ ์ œ๊ณตํ•œ ์ฃผ์š”ํ•œ ๋ชฉ์ ์€ ์ƒ์—…์ ์ธ ์„ฑ๊ฒฉ์€ ๊ฐ„์ ‘์ ์ด๊ณ  ๋ถ€์ฐจ์ ์ธ ๊ฒƒ์— ๋ถˆ๊ณผํ•œ ์ ,
  3. ์ธ๋„ค์ผ ์ด๋ฏธ์ง€๋Š” ์ผ๋ฐ˜ ์‚ฌ์ง„์ž‘ํ’ˆ์˜ ์‹ฌ๋ฏธ์ ์ด๊ณ  ์˜ˆ์ˆ ์ ์ธ ๋ชฉ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๊ณ  ์‚ฌ์ง„์˜ ๋ณธ์งˆ์ ์ธ ๋ฉด์— ์„œ ์‚ฌ์šฉํ•œ ๊ฒƒ์œผ๋กœ๋Š” ๋ณด๊ธฐ ์–ด๋ ค์šด ์ ,
  4. ์ธ๋„ค์ผ์ด๋ฏธ์ง€๋ฅผ ๊ฒŒ์‹œํ•œ ๊ฒƒ์ด ๊ณต์†Œ์™ธ์ธ์˜ ์ž‘ํ’ˆ์‚ฌ ์ง„์— ๋Œ€ํ•œ ์ˆ˜์š”๋ฅผ ๋Œ€์ฒดํ•œ๋‹ค๊ฑฐ๋‚˜ ๊ณต์†Œ์™ธ์ธ์˜ ์‚ฌ์ง„ ์ €์ž‘๋ฌผ์— ๋Œ€ํ•œ ์ €์ž‘๊ถŒ์นจํ•ด์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ธฐ๋Š” ์–ด๋ ค์šด ์ ,
  5. ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰์„ ์ด์šฉํ•˜๋Š” ์‚ฌ์šฉ์ž๋“ค๋„ ์ธ๋„ค์ผ ์ด๋ฏธ์ง€ ๋ฅผ ์ž‘ํ’ˆ์‚ฌ์ง„์œผ๋กœ ๊ฐ์ƒํ•˜๊ธฐ๋ณด๋‹ค๋Š” ์ด๋ฏธ์ง€์™€ ๊ด€๋ จ๋œ ์‚ฌ์ดํŠธ๋ฅผ ์ฐพ์•„๊ฐ€๋Š” ํ†ต๋กœ๋กœ ์ธ์‹ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ์  ๋ฐ ์ธ๋„ค์ผ ์ด๋ฏธ์ง€์˜ ์‚ฌ์šฉ์€ ๊ฒ€์ƒ‰์‚ฌ์ดํŠธ๋ฅผ ์ด์šฉํ•˜๋Š” ์‚ฌ์šฉ์ž๋“ค์—๊ฒŒ ๋ณด๋‹ค ์™„๊ฒฐ๋œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•œ ๊ณต์ต์  ์ธก๋ฉด์ด ๊ฐ•ํ•œ ์ 

๋“ฑ์„ ์ด์œ ๋กœ ์ธ๋„ค์ผ์˜ ์‚ฌ์šฉ์€ ์ • ๋‹นํ•œ ๋ฒ”์œ„์•ˆ์—์„œ ๊ณต์ •ํ•œ ๊ด€ํ–‰์— ํ•ฉ์น˜๋˜๊ฒŒ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด๊ณ  ํŒ์‹œํ•˜์˜€๋‹ค.

  1. ๋Œ€๋ฒ•์› 2006.2.9. ์„ ๊ณ  2005๋„7793 ํŒ๊ฒฐ

๋„๋ฉ”์ธ

Coding Style

  • ์ƒ๋žต

Documentation

  • ํ”„๋กœ๊ทธ๋žจ, ์†Œ์Šค์ฝ”๋“œ์— ๋Œ€ํ•œ ๋ฌธ์„œ๋Š” 1) ์ฝ”๋“œ ํŒŒ์ผ ๋‚ด ๋‚ด์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์šฐ์„ ์ ์œผ๋กœ ์„ ํƒํ•˜๊ณ , 2) ๊ณค๋ž€ํ•œ ๊ฒฝ์šฐ, ๋ณ„๋„ ์œ„ํ‚ค ๋ฌธ์„œ๋กœ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌธ์„œ๋ฅผ ์ž‘์„ฑํ•  ๋•Œ์—๋Š” ์ผ๋ฐ˜์ ์ธ ์˜คํ”ˆ์†Œ์Šค ํ”„๋กœ๊ทธ๋žจ์˜ ๋ฌธ์„œ ์ˆ˜์ค€์— ์ค€ํ•˜์—ฌ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Python - pydoc
  • Perl - perlpod, the Plain Old Documentation format

Web Redering

์•„์ด๋””์–ด

  • ๋ช‡๊ฐ€์ง€ ์ž๋ฃŒ์กฐ์‚ฌ๋ฅผ ํ•ด ๋ณธ ๋ฐ”๋กœ๋Š” phantomjs ๊ฐ€ ๊ฐ„๋‹จํ•˜๊ณ  ์‚ฌ์šฉํ•˜๊ธฐ ํŽธ๋ฆฌํ•ด ๋ณด์ธ๋‹ค. ์ž„์˜์˜ javascript code๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ๋„ ํŽธ๋ฆฌํ•˜๊ณ , python, perl ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐœ๋ฐœ์–ธ์–ด์—์„œ binding ์„ ์ง€์›ํ•˜๋Š” ์žฅ์ ๋„ ์žˆ๋‹ค. ์šฐ๋ถ„ํˆฌ ?? ๋ฒ„์ „์—์„œ phantomjs 1.4.0+dfsg-1 ๋ฒ„์ „์ด ๋ฐ”์ด๋„ˆ๋ฆฌ ํŒจํ‚ค์ง€๋กœ ์ œ๊ณต๋œ๋‹ค. --๊น€์ •๊ฒธ, 2013-08-22