Bookmarks - aragorn/home GitHub Wiki
์์ฐ์ด์ฒ๋ฆฌ
-
https://ratsgo.github.io/natural%20language%20processing/2017/03/09/rnnlstm/
-
https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/
๋จธ์ ๋ฌ๋
-
MLE - https://en.wikipedia.org/wiki/Maximum_likelihood_estimation
-
MAP - https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation
-
Linear regression: Regularization
-
Causal Impact - https://google.github.io/CausalImpact/CausalImpact.html
-
https://paper.dropbox.com/doc/Machine-Learning-Deep-Learning-Study-Guides-1aNx7WJ7c3NWf71qnEPOI
-
http://gorakgarak.tistory.com/category/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D/%20TensorFlow
-
http://darkpgmr.tistory.com/category/%EA%B8%B0%EA%B3%84%ED%95%99%EC%8A%B5
-
ํ๊ตญ์ด ๋จธ์ ๋ฌ๋ ๊ฐ๊ด ์๋ฃ at ๋ผ์จํผํ - http://laonple.blog.me/220463627091
์ ๋ฆฌ ์ ๋จ 1
- Transaction strategies: Understanding transaction pitfalls http://www.ibm.com/developerworks/library/j-ts1/
- Counting Queries per Request with Hibernate and Spring http://knes1.github.io/blog/2015/2015-07-08-counting-queries-per-request-with-hibernate-and-spring.html
- JVM Garbage Collection
http://yckwon2nd.blogspot.kr/2014/04/garbage-collection.html?m=1
์ ๋ฆฌ ์ ๋จ 2
- https://code.google.com/p/conque/wiki/Usage
- http://dorey.github.io/JavaScript-Equality-Table/
Use three equals unless you fully understand the conversions that take place for two-equals. - https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
- Visualizing Git Concepts with D3
http://www.wei-wang.com/ExplainGitWithD3/
databases
- The time series database service
- https://tempo-db.com/
- TempoDB is purpose-built to store & analyze time series data from sensors, smart meters, servers & more
์ฌ๋ด ๊ฒ์ ๊ด๋ จ
- http://cp.news.search.daum.net/partner/guide_qna ๋ด์ค CP ์ ํด
- http://cp.news.search.daum.net/partner/guide_enterance
- http://syndi-guide.search.daum.net:3000/ ๋์์ ์ ํด
Unix & Programming General
- http://www.linusakesson.net/programming/tty/
TTY, process/job/session, signal ๋ฑ์ ๊ด๊ณ - http://gcc.godbolt.org/
์จ๋ผ์ธ์์ ๋ฐ๋ก C/C++ ์ฝ๋๋ฅผ ์ปดํ์ผํด์ ์ด์ ๋ธ๋ฆฌ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ ์๋ ์นํ์ด์ง - Sorting Algorithm Animations http://www.sorting-algorithms.com/
- http://offbytwo.com/2011/06/26/things-you-didnt-know-about-xargs.html
- http://mark.stosberg.com/blog/2010/12/percent-encoding-uris-in-perl.html
ํฌ๊ฒ ๋ณด์ CGI.pm์ ์ด์ฉํ๋ ๋ฐฉ๋ฒ๊ณผ URI::Escape๋ฅผ ์ด์ฉํ๋ ๋ฐฉ๋ฒ์ด ์๋๋ฐ, ๋ ๋ค ์๋ฒฝํ์ง๋ ์๋ค
Data Visualization
- http://mbostock.github.io/protovis/
- http://www.colorzilla.com/gradient-editor/
- Why Should Engineers and Scientists Be Worried About Color?
http://www.research.ibm.com/people/l/lloydt/color/color.HTM
hash library
Benchmarks and etc
- Hash functions: An empirical comparison http://www.strchr.com/hash_functions
- Meiyan http://www.sanmayce.com/Fastest_Hash/
- Benchmarking CRC32 and PopCnt instructions http://www.strchr.com/crc32_popcnt
- xxhash256 http://extrememoderate.wordpress.com/2012/06/10/xxhash256-update/
- CRC16 - CRC64 test results on 18.2M dataset, w/program source http://www.backplane.com/matt/crc64.html
- Program & Test Run by Matt Dillon
- 18.2M message-id dataset supplied by Joe Greco
- This article only discusses how to write a fast CRC32 algorithm in C/C++. http://create.stephan-brumme.com/crc32/
- http://developers.blog.box.com/2011/10/12/crc32-checksums-the-good-the-bad-and-the-ugly/
SMHasher & MurmurHash v3
MurmurHash v2
https://sites.google.com/site/murmurhash/
Extremely simple - compiles down to ~52 instructions on x86. Excellent distribution - Passes chi-squared tests for practically all keysets & bucket sizes. Excellent avalanche behavior - Maximum bias is under 0.5%. Excellent collision resistance - Passes Bob Jenkin's frog.c torture-test. No collisions possible for 4-byte keys, no small (1- to 7-bit) differentials. Excellent performance - measured on an Intel Core 2 Duo @ 2.4 ghz
- OneAtATime - 354.163715 mb/sec
- FNV - 443.668038 mb/sec
- SuperFastHash - 985.335173 mb/sec
- lookup3 - 988.080652 mb/sec
- MurmurHash 1.0 - 1363.293480 mb/sec
- MurmurHash 2.0 - 2056.885653 mb/sec
jQuery
- menu-aim is a jQuery plugin for dropdown menus that can differentiate between a user trying hover over a dropdown item vs trying to navigate into a submenu's contents.
Mathematics
- Partially ordered set, poset https://en.wikipedia.org/wiki/Partially_ordered_set
- Partially ordered set์ ์ด์ฉํ๋ฉด, ์น๋ฌธ์์ ์์ฑ์ผ์ ์ถ์ ํ ์ ์๋ค.
- http://en.wikipedia.org/wiki/Learning_to_rank
Tips
- Chrome Record modes and Playback modes
- Record modes let you record every request Chrome makes. Playback mode serves requests out of that recorded cache just as if they were being loaded on the spot. It doesn't record where you click or what you open, just every request as it moves over the wire. http://dev.hubspot.com/blog/bulletproof-demos
- What should every programmer know about web development?
Benchmarks
- HTTP Client Performance โ IO http://blogs.atlassian.com/2013/07/http-client-performance-io/
HTTP
- Wikipedia
- http://en.wikipedia.org/wiki/Data-driven_programming
- http://en.wikipedia.org/wiki/Modular_programming
- Length of URLs - http://www.boutell.com/newfaq/misc/urllength.html
Web Development
SEO - Search Engine Optimization
- Making AJAX Applications Crawlable
Traditionally, hash fragments (that is, everything after # in the URL) have been used to indicate one portion of a static HTML document. By contrast, AJAX applications often use hash fragments in another function, namely to indicate state. For example, when a user navigates to the URL
http://www.example.com/ajax.html#key1=value1&key2=value2
, the AJAX application will parse the hash fragment and move the application to the "key1=value1&key2=value2" state. This is similar in spirit to moving to a portion of a static document, that is, the traditional use of hash fragments. History (the back button) in AJAX applications is generally handled with these hash fragments as well. Why are hash fragments used in this way? While the same effect could often be achieved with query parameters (for example,?key1=value1&key2=value2
), hash fragments have the advantage that in and of themselves, they do not incur an HTTP request and thus no round-trip from the browser to the server and back. In other words, when navigating fromwww.example.com/ajax.html
towww.example.com/ajax.html#key1=value1&key2=value2
, the web application moves to the statekey1=value1&key2=value2
without a full page refresh. As such, hash fragments are an important tool in making AJAX applications fast and responsive. Importantly, however, hash fragments are not part of HTTP requests (and as a result they are not sent to the server), which is why our approach must handle them in a new way. See RFC 3986 for more details on hash fragments.
contents syndication protocol
- sitemaps - http://www.sitemaps.org/protocol.html
์ด๋ค ์ฌ์ดํธ์ ์ปจํ ์ธ ๋ฅผ ๊ฒ์์ ๋ ธ์ถํ๊ณ ์ ํ๋ ๊ฒฝ์ฐ, sitemaps ํ๋กํ ์ฝ์ ์ด์ฉํ๋ฉด ์ ์ฒด ์ปจํ ์ธ ๋ฆฌ์คํธ๋ฅผ ๋ด๋ณด๋ผ ์ ์๋ค. sitemaps ํ์ผ์ ์ฌ๋ฌ๊ฐ ๋ง๋ค ์ ์์ผ๋ฉฐ, ping ํ๋กํ ์ฝ์ ํตํด ๊ฒ์์์ง์ ์์ ์ sitemaps ๊ฐ ๊ฐฑ์ ๋์๋ค๋ ์ ํธ๋ฅผ ๋ณด๋ผ ์๋ ์๋ค. ๋ค์ด๋ฒ ์ปจํ ์ธ ์ ๋์ผ์ด์ ํ๋กํ ์ฝ๋ณด๋ค ๋ ๋์ ๊ฒ ๊ฐ๋ค. --๊น์ ๊ฒธ - naver syndication api - http://dev.naver.com/openapi/apis/function/syndication
Web search and crawler
Lectures
We will discuss the design of a Web search engine and the extraction of information off the Web. Topics include Web crawlers. Database design. Query language. Relevance ranking Document Similarity and Clustering The "invisible" Web Specialized search engines Evaluation. Natural Language Processing The structure of the web Web content mining Web usage mining Business model: Pricing advertizing Multi-media retrieval. Multilingual retrieval.
Textbooks
- Mining of Massive Datasets by Anand Rajaraman (@anand_raj) and Jeff Ullman
Wikipedia
For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur. To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002).
Cho98, Efficient Crawling Through URL Ordering
- Cho, J.; Garcia-Molina, H.; Page, L. (1998-04). "Efficient Crawling Through URL Ordering"
- ํฌ๊ฒ ์ธ ๊ฐ์ง ์ ๊ทผ ์ ๋ต์ ์๊ฐํ๊ณ ์๋ค.
- topic๊ณผ keyword๋ค์ ๋จผ์ ์ ํด ๋์ ํ, ์นํ์ด์ง ๋ด์ ์ด ํค์๋๋ค์ ๋งค์นญ ์ ์๊ฐ ๋์ ๊ฒฝ์ฐ, ์ฐ์ ํ์ฌ ๋ฐฉ๋ฌธํ๋ค. ํนํ anchor text์ ์ด ๋จ์ด๋ค์ด ๋งค์นญ๋๋ ๊ฒฝ์ฐ ๋์ ๊ฐ์ค์น๋ฅผ ์ค๋ค. ์ด ๋ฐฉ๋ฒ์ ์ค์ง์ ์ผ๋ก, ๊ฒ์๋ญํน ์ ์๋ฅผ ๋ฐ๋ก ๊ณ์ฐํ์ฌ์, ๊ฒ์๋ญํน์์ ์์์ ๋์ผ ๋ฌธ์๋ค์ ์ฐ์ ์ ์ผ๋ก ๊ฐ์ ธ์ค๋ ๋ฐฉ์์ด๋ฉฐ, ํ ๊ฒ์์๋น์ค์ ๊ฒ์๊ฒฐ๊ณผ๋ฅผ ์์งํ๋ ๊ฒ์ ๋ถ๋ถ์ ์ผ๋ก ๋์ฒด ๊ฐ๋ฅํ ๋ฐฉ๋ฒ์ด๋ค. ํน์ ์ฃผ์ ๋ฅผ ์ฐ์ ์ ์ผ๋ก ์์งํ์ฌ์ผ ํ๋ ๊ฒฝ์ฐ, ์ด ๋ฐฉ๋ฒ์ ์ธ ์ ์๋ค.
- backlink count๋ฅผ ๊ณ์ฐํ ํ, backlink count๊ฐ ๋์ ๊ฒฝ์ฐ ์ฐ์ ํ์ฌ ๋ฐฉ๋ฌธํ๋ค.
- pagerank๋ฅผ ๊ณ์ฐํ ํ, pagerank๊ฐ ๋์ ๊ฒฝ์ฐ ์ฐ์ ํ์ฌ ๋ฐฉ๋ฌธํ๋ค.
- Performance ํ๊ฐ์ ์์ด์, ๋๋ค ํฌ๋กค๋ง์ด๋ ๋๋น ์ฐ์ ํฌ๋กค๋ง ๋ฐฉ์๊ณผ ๋๋นํ์ฌ, pagerank๊ฐ ๋์ hot page๋ฅผ ์ผ๋ง๋ ๋นจ๋ฆฌ ๋ฐฉ๋ฌธํ๋๋๋ฅผ metric์ผ๋ก ์ผ๊ณ ์๋ค. hot page๋ฅผ ์ ์ํ ๋, pagerank๊ฐ ๋์ ๋ฌธ์๊ฐ hot page๊ฐ ๋๊ธฐ ๋๋ฌธ์, performance ํ๊ฐ์ ์์ด์, ์ฐ์ ํ์ฌ ๋ฐฉ๋ฌธํ๊ฒ ๋ค๊ณ ์ ํํ ๊ฒ๋ค์ ๋นจ๋ฆฌ ๊ฐ์ ธ์ค๋์ง ํ๊ฐํ๊ณ ์๋ ์ ์ด๋ค. ๋น์ฐํ ์ด๋ ๊ฒ ๋์ด์ผ ํ ๊ฒ ๊ฐ์๋ฐ.
- ๋ ผ๋ฌธ์ ๋ง์ง๋ง ๋ถ๋ถ ๊ฒฐ๋ก
In general our results show that PageRank, IRโ(P), is an excellent ordering metric when either pages with many backlinks or with high PageRank are sought. In addition, if the similarity to a driving query is important, then it is also useful to visit earlier URLs that:
- Have anchor text that is similar to the driving query;
- Have some of the query terms within the URL itself; or
- Have a short link distance to a page that is known to be hot.
With a good ordering strategy, it seems to be possible to build crawlers that can rather quickly obtain a significant portion of the hot pages. This can be extremely useful when we are trying to crawl large portions of the Web, when are resources are limited, or when we need to revisit pages often to detect changes.
Cho2001, Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data
- Cho, Junghoo, "Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data"
- ๋ฐ์ฌ ๋ ผ๋ฌธ์ด์ด์ ๋ณธ๋ฌธ ์์ด ๊ฝค ๋๋ค. ๋ฌธ์ ๋ณ๊ฒฝ, ์ฌ๋ฐฉ๋ฌธ ์ฃผ๊ธฐ, ์ด๋ฏธ ๋ฐ์ดํฐ๊ฐ ๊ฝ ์ฐจ ์์ ๋ ์ด๋ป๊ฒ ๋ฐ์ดํฐ ์์งํ์ฌ ์ ์ฅํ ๊ฒ์ธ์ง ๋ฑ ์ฌ๋ฌ ์ด์๋ฅผ ๋ค๋ฃจ๊ณ ์๋ค.
Robots Exclusion Standard
- http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
- http://en.wikipedia.org/wiki/Sitemaps
- ์น๋ก๋ด์ ์ ์
- ์คํ๋ผ์ธ ๋ธ๋ผ์ฐ์ ์ ๋ฌด์จ ์ฐจ์ด?
- ์์ง๊ณผ ๊ฒ์ ๋ ธ์ถ์ ์ฐจ์ด
- ์ ์๊ถ ์ด์
- ๊ฒ์๊ฒฐ๊ณผ์ ๋ ธ์ถํ๋ ๊ฒฝ์ฐ, ์ ์๊ถ์์ ํ๋ฝ์ด๋ ๋์๊ฐ ์์ด์ผ ํ๋?
- ๊ฒ์์์ง๊ณผ ์ ์๊ถ์ ๋, 2008๋
์ 34ํธ
- ํ๊ตญ์ ์๊ถ์์ํ https://www.copyright.or.kr/info/ground_view.do?bd_seq=6229&cPage=86
์ง๊ธ๊น์ง ์ดํด๋ณธ ๋ฐ์ ๊ฐ์ด ๊ฒ์์์ง์ ํตํ โ๋งํฌ ๋ฌธ์ โ๋ ์ ์๊ถ ๊ด๋ จ ๋ฌธ์ ์ ์์ด์ ์ค์ํ ์ด์์ค ํ๋์์๋ ํ๋ฆผ์ด ์๋ค. ํนํ ์ธ๋ค์ผ ์ด๋ฏธ์ง์ ์ฌ์ฉ์ ์ฐ๋ฆฌ๋๋ผ์์๋ ๋ฌธ์ ๊ฐ ๋์์๋ค. ๊ทธ๋ฌ๋ ์ธ๋ค์ผ ์ด๋ฏธ์ง์ ๋ํด ์ฐ๋ฆฌ ๋๋ฒ์์ ์ธ๋ค์ผ ํ๊ฒฐ10)์ ํตํด์
- ๊ฒ์์ฌ์ดํธ์ ์ธ๋ค์ผ ์ด๋ฏธ์ง์ ํํ๋ก ๊ฒ์๋ ๊ณต์์ธ์ธ์ ์ฌ์ง์ํ๋ค์ ๊ณต์์ธ์ธ์ ๊ฐ์ธ ํํ์ด์ง์์ ์ด๋ฏธ ๊ณตํ๋ ๊ฒ์ธ ์ ,
- ํผ๊ณ ์ธ ํ์ฌ๊ฐ ์ธ๋ค์ผ ์ด๋ฏธ์ง๋ฅผ ์ ๊ณตํ ์ฃผ์ํ ๋ชฉ์ ์ ์์ ์ ์ธ ์ฑ๊ฒฉ์ ๊ฐ์ ์ ์ด๊ณ ๋ถ์ฐจ์ ์ธ ๊ฒ์ ๋ถ๊ณผํ ์ ,
- ์ธ๋ค์ผ ์ด๋ฏธ์ง๋ ์ผ๋ฐ ์ฌ์ง์ํ์ ์ฌ๋ฏธ์ ์ด๊ณ ์์ ์ ์ธ ๋ชฉ์ ์ ๊ฐ์ง๊ณ ์์ง ์๊ณ ์ฌ์ง์ ๋ณธ์ง์ ์ธ ๋ฉด์ ์ ์ฌ์ฉํ ๊ฒ์ผ๋ก๋ ๋ณด๊ธฐ ์ด๋ ค์ด ์ ,
- ์ธ๋ค์ผ์ด๋ฏธ์ง๋ฅผ ๊ฒ์ํ ๊ฒ์ด ๊ณต์์ธ์ธ์ ์ํ์ฌ ์ง์ ๋ํ ์์๋ฅผ ๋์ฒดํ๋ค๊ฑฐ๋ ๊ณต์์ธ์ธ์ ์ฌ์ง ์ ์๋ฌผ์ ๋ํ ์ ์๊ถ์นจํด์ ๊ฐ๋ฅ์ฑ์ ๋์ด๋ ๊ฒ์ผ๋ก ๋ณด๊ธฐ๋ ์ด๋ ค์ด ์ ,
- ์ด๋ฏธ์ง ๊ฒ์์ ์ด์ฉํ๋ ์ฌ์ฉ์๋ค๋ ์ธ๋ค์ผ ์ด๋ฏธ์ง ๋ฅผ ์ํ์ฌ์ง์ผ๋ก ๊ฐ์ํ๊ธฐ๋ณด๋ค๋ ์ด๋ฏธ์ง์ ๊ด๋ จ๋ ์ฌ์ดํธ๋ฅผ ์ฐพ์๊ฐ๋ ํต๋ก๋ก ์ธ์ํ ๊ฐ๋ฅ์ฑ์ด ๋์ ์ ๋ฐ ์ธ๋ค์ผ ์ด๋ฏธ์ง์ ์ฌ์ฉ์ ๊ฒ์์ฌ์ดํธ๋ฅผ ์ด์ฉํ๋ ์ฌ์ฉ์๋ค์๊ฒ ๋ณด๋ค ์๊ฒฐ๋ ์ ๋ณด๋ฅผ ์ ๊ณตํ๊ธฐ ์ํ ๊ณต์ต์ ์ธก๋ฉด์ด ๊ฐํ ์
๋ฑ์ ์ด์ ๋ก ์ธ๋ค์ผ์ ์ฌ์ฉ์ ์ ๋นํ ๋ฒ์์์์ ๊ณต์ ํ ๊ดํ์ ํฉ์น๋๊ฒ ์ฌ์ฉํ ๊ฒ์ด๊ณ ํ์ํ์๋ค.
- ๋๋ฒ์ 2006.2.9. ์ ๊ณ 2005๋7793 ํ๊ฒฐ
- http://googleblog.blogspot.kr/2007/02/robots-exclusion-protocol.html
- https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
๋๋ฉ์ธ
- ํ๊ธ ๋๋ฉ์ธ ๊ด๋ จ : IDN
Coding Style
- ์๋ต
Documentation
- ํ๋ก๊ทธ๋จ, ์์ค์ฝ๋์ ๋ํ ๋ฌธ์๋ 1) ์ฝ๋ ํ์ผ ๋ด ๋ด์ฅํ๋ ๋ฐฉ๋ฒ์ ์ฐ์ ์ ์ผ๋ก ์ ํํ๊ณ , 2) ๊ณค๋ํ ๊ฒฝ์ฐ, ๋ณ๋ ์ํค ๋ฌธ์๋ก ์์ฑํฉ๋๋ค.
- ๋ฌธ์๋ฅผ ์์ฑํ ๋์๋ ์ผ๋ฐ์ ์ธ ์คํ์์ค ํ๋ก๊ทธ๋จ์ ๋ฌธ์ ์์ค์ ์คํ์ฌ ์์ฑํฉ๋๋ค.
- Python - pydoc
- Perl - perlpod, the Plain Old Documentation format
Web Redering
- http://en.wikipedia.org/wiki/Web_browser_engine
- http://phantomjs.org/
- PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.
- Selenium http://docs.seleniumhq.org/
- Selenium automates browsers. That's it. What you do with that power is entirely up to you. Primarily it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well.
- Python selenium binding http://selenium.googlecode.com/git/docs/api/py/index.html
- Perl Selenium Binding http://search.cpan.org/~mattp/Test-WWW-Selenium-1.36/lib/WWW/Selenium.pm
- Webkit Perl Binding ์ ๋ํ ์๊ฐ์๋ฃ, html presentation http://potyl.github.io/Talk-WebKit-Perl/ - ํ์ดํ ํค๋ก ํ๋ฉด ๋๊ธฐ๋ฉด ๋ฉ๋๋ค.
์์ด๋์ด
- ๋ช๊ฐ์ง ์๋ฃ์กฐ์ฌ๋ฅผ ํด ๋ณธ ๋ฐ๋ก๋ phantomjs ๊ฐ ๊ฐ๋จํ๊ณ ์ฌ์ฉํ๊ธฐ ํธ๋ฆฌํด ๋ณด์ธ๋ค. ์์์ javascript code๋ฅผ ์คํํ๊ธฐ๋ ํธ๋ฆฌํ๊ณ , python, perl ๋ฑ ์ฌ๋ฌ ๊ฐ๋ฐ์ธ์ด์์ binding ์ ์ง์ํ๋ ์ฅ์ ๋ ์๋ค. ์ฐ๋ถํฌ ?? ๋ฒ์ ์์ phantomjs 1.4.0+dfsg-1 ๋ฒ์ ์ด ๋ฐ์ด๋๋ฆฌ ํจํค์ง๋ก ์ ๊ณต๋๋ค. --๊น์ ๊ฒธ, 2013-08-22