Machine Learning - sgml/signature GitHub Wiki
Data preprocessing involves gathering and cleaning data, which is essential for any machine learning task.
In new research accepted for publication in Chaos, they showed that improved predictions of chaotic systems like the Kuramoto-Sivashinsky equation become possible by hybridizing the data-driven, machine-learning approach and traditional model-based prediction. Ott sees this as a more likely avenue for improving weather prediction and similar efforts, since we don’t always have complete high-resolution data or perfect physical models. “What we should do is use the good knowledge that we have where we have it,” he said, “and if we have ignorance we should use the machine learning to fill in the gaps where the ignorance resides.”
AI is skilled at games Machine learning is skilled at statistics "Deep learning is also highly susceptible to bias. When Google's facial recognition system was initially rolled out, for instance, it tagged many black faces as gorillas.
Child Psychology
- https://www.hackster.io/Fryden-Learning/ai-and-machine-learning-for-kids-2baa1f
- https://machinelearningforkids.co.uk/#!/worksheets
- https://www.commonsense.org/education/top-picks/best-coding-tools-for-middle-school
- https://www.stevemurch.com/machine-learning-ai-for-kids-resources/2018/12
- https://blog.ozobot.com/parenting/parents-can-explain-artificial-intelligence-machine-learning-kids/
- https://www.samplereality.com/2015/09/05/your-mistake-was-a-vital-strategy/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6287292/
- https://www.pnas.org/content/105/13/5012
- https://scienceblogs.com/cognitivedaily/2009/04/29/do-we-reason-with-statistics-i
- https://datascience.aero/how-much-data-you-need/
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.878.1367&rep=rep1&type=pdf
- http://www.cogsci.ucsd.edu/~deak/cdlab/Publication/LietalTR2009.pdf
Storytelling
I think stories are what make us different from chimpanzees and Neanderthals. And if story-understanding is really where it’s at, we can’t understand our intelligence until we understand that aspect of it.
- http://nautil.us/issue/75/story/the-storytelling-computer
- http://people.csail.mit.edu/phw/index.html
- https://www.memoriesofpatrickwinston.com/
- https://phys.org/news/2017-12-arrow-relative-concept-absolute.html
Obscurity
writers_editors:
- name: "Walter Bagehot (New Zealand)"
url: "https://www.economicshelp.org/blog/26107/economics/walter-bagehot/"
- name: "Benjamin Constant (France)"
url: "https://plato.stanford.edu/entries/constant/"
- name: "Giuseppe Mazzini (Italy)"
url: "https://spartacus-educational.com/ITmazzini.htm"
- name: "Juan Pablo II (Poland)"
url: "https://www.jstor.org/stable/10.5325/jjohnpajstud.4.2.0001"
- name: "José Martí (Cuba)"
url: "https://www.jstor.org/stable/30209112"
Scaffolding
- https://www.forbes.com/sites/janakirammsv/2018/01/01/why-do-developers-find-it-hard-to-learn-machine-learning/#4237864f6bf6
- https://marutitech.com/problems-solved-machine-learning/
- https://www.linkedin.com/pulse/cat-teaching-computers-how-learn-mike-volpi/
- https://machinelearningmastery.com/youre-wrong-machine-learning-not-hard/
- https://www.robinwieruch.de/machine-learning-javascript-web-developers/
- https://law.vanderbilt.edu/files/archive/Judicial_Intuition.pdf
- http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
- https://www.gamasutra.com/view/feature/130578/visual_finite_state_machine_ai_.php
- https://www.codeproject.com/Articles/1275031/Why-Real-Neurons-Learn-Faster
- https://www.quora.com/Is-there-a-way-to-use-Machine-Learning-to-predict-the-outcome-of-a-coin-toss
- https://medium.com/@davidllorente/nlg-technologies-artificial-intelligence-vs-rule-base-approach-cf8e9992461e
- https://medium.com/@davidllorente/automatic-natural-language-generation-the-new-normal-cd36ed8976de
- https://blog.wolfram.com/2010/11/30/how-to-win-at-coin-flipping/
- https://medium.com/@mikeharrisNY/the-heist-is-the-coin-toss-77ee4d870037
- https://docs.aws.amazon.com/silk/latest/developerguide/machine-learning.html
Concepts
- https://openstax.org/details/books/principles-data-science
- https://developers.google.com/machine-learning/glossary/generative
- http://people.csail.mit.edu/phw/mit.html
- https://web.archive.org/web/20240923014939/https://media.licdn.com/dms/image/v2/D4E22AQGaPFklcD4qHg/feedshare-shrink_800/feedshare-shrink_800/0/1725535189735?e=1729728000&v=beta&t=oYF0O87cI6zVZS9F-rAF5A9V-nI13ScqS5IPsz2oA1o
- https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
- https://www.sqlservercentral.com/articles/machine-learning-101-the-mathematics-of-an-artificial-neural-network-6
- http://inverseprobability.com/talks/notes/probabilistic-machine-learning.html
- https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01120/full
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2722922/
- https://www.hindawi.com/journals/complexity/2019/2952304/
- http://www.alice.id.tue.nl/references/kahnemann-2003.pdf
- https://stackoverflow.com/questions/616292/is-it-possible-for-a-computer-to-learn-a-regular-expression-by-user-provided-e
- https://ruccs.rutgers.edu/images/personal-rochel-gelman/publications/Obrecht_Chapman_Gelman_2007_Intuitive_t_Tests_Lay_use_of_statistical_information.pdf
- https://ir.lib.uwo.ca/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1230&context=etd
- https://en.wikipedia.org/wiki/Sortition
- https://www.nature.com/articles/ncomms15694:
- https://en.wikipedia.org/wiki/Logic_learning_machine
- https://en.wikipedia.org/wiki/Hierarchical_classifier
- https://en.wikipedia.org/wiki/Quasi-likelihood
- https://www.microsoft.com/en-us/research/wp-content/uploads/2017/02/unknown_unknowns_identify_algo.pdf
- https://www.quora.com/Are-artificial-intelligence-compiler-theory-and-Automata-related-together-If-so-then-how
Research
- https://elitedatascience.com/machine-learning-algorithms
- https://www.riskiq.com/blog/external-threat-management/machine-learning-silver-bullet/
- https://medium.com/data-from-the-trenches/marketing-attribution-tutorial-part-2-7bb78dec502
- https://www.ibm.com/developerworks/community/blogs/jfp/entry/Feature_Engineering_For_Deep_Learning?lang=en
- https://www.ibm.com/developerworks/community/blogs/jfp/entry/Machine_Learning_As_Prescriptive_Analytics?lang=en
- https://arxiv.org/pdf/1907.06094.pdf
- https://www.nasdaq.com/articles/5-ways-companies-are-transforming-their-businesses-machine-learning-2019-03-13
- https://www.codeproject.com/Articles/4414/A-Proposed-Model-for-Simulating-Human-Artificial-I
- https://sports.stackexchange.com/questions/749/do-basketball-players-tend-to-improve-at-shooting-free-throws-over-the-course-of
- https://sports.stackexchange.com/questions/12074/how-to-collect-stats-in-nba
- https://sports.stackexchange.com/questions/20922/is-there-a-standard-advanced-stat-in-basketball-for-measuring-player-consist
- https://prosportsanalytics.com/2017/05/25/predicting-nba-salaries-part-1/
- http://horror.dreamdawn.com/?p=49113
- https://www.gamasutra.com/blogs/PaulTozour/20141216/232023/The_Game_Outcomes_Project_Part_1_The_Best_and_the_Rest.php
- https://ai.stackexchange.com/questions/5577/is-it-possible-to-write-an-adaptive-parser
- https://conf.slac.stanford.edu/xldb2019/sites/xldb2019.conf.slac.stanford.edu/files/Wed_10.55_Seyed_Umar_BigQueryML-XLDB2019.pdf
- https://towardsdatascience.com/deep-neural-network-implemented-in-pure-sql-over-bigquery-f3ed245814d3
SQL Algorithms
- https://thenewstack.io/sql-fans-can-now-develop-ml-applications/
- https://www.snowflake.com/blog/synthetic-data-generation-at-scale-part-2/
- https://towardsdatascience.com/machine-learning-with-sql-ae46b1fe78a9?gi=bd255df749bd
- https://towardsdatascience.com/learning-sql-201-optimizing-queries-regardless-of-platform-918a3af9c8b1
Data Quality
- http://www.oecd.org/statistics/statisticalresources.htm
- https://apiumhub.com/tech-blog-barcelona/introduction-perceptual-hashes-measuring-similarity/
- https://www.openprisetech.com/blog/i-predict-your-predictive-scoring-project-will-fail-heres-why-and-how-to-do-it-right/
- http://trap.ncirl.ie/3437/1/marinalambert.pdf
- https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/data-mining-using-machine-learning-to-rediscover-customers-paper.pdf
- https://docs.oracle.com/cd/E14004_01/books/PDF/MKTG_User.pdf
- https://steelkiwi.com/blog/what-is-machine-learning/
- https://www.altexsoft.com/blog/datascience/preparing-your-dat
- https://sports.stackexchange.com/questions/20922/is-there-a-standard-advanced-stat-in-basketball-for-measuring-player-consist
- https://sports.stackexchange.com/questions/12074/how-to-collect-stats-in-nba
Techniques
- https://rosettacode.org/wiki/Rock-paper-scissors
- https://stripe.com/blog/fraud-reporting
- https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/
- https://www.linkedin.com/pulse/predicting-buying-behavior-through-machine-learning-case-mitra/
- https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-basic-techniques-that-make-your-data-better/
- https://www.linkedin.com/pulse/naive-bayes-classifier-foundation-machine-learning-chase-perkins
- https://www.reddit.com/r/askscience/comments/3zghfk/mathematics_probability_question_do_we_treat_coin/
- https://statweb.stanford.edu/~susan/papers/headswithJ.pdf
- https://www.kaggle.com/yusukesaito0141/bigqueryml-is-all-you-need
- https://codegolf.stackexchange.com/questions/11880/build-a-working-game-of-tetris-in-conways-game-of-life
- https://gamedev.stackexchange.com/questions/55151/rpg-logarithmic-leveling-formula
- https://www.forbes.com/sites/audreymurrell/2019/05/30/big-data-and-the-problem-of-bias-in-higher-education/
- https://2021.ai/fairness-in-machine-learning/
- https://newsroom.haas.berkeley.edu/minority-homebuyers-face-widespread-statistical-lending-discrimination-study-finds/
- http://machineintelligenceafrica.org/about/
- https://www.technologyreview.com/s/613848/ai-africa-machine-learning-ibm-google/
- https://medium.com/pandorabots-blog/aiml-tutorial-the-srai-tag-5bb1f9d08169
OCR
Decision Trees
- https://www.w3schools.com/python/python_ml_decision_tree.asp
- https://www.slideshare.net/ShivangiGupta54/tree-pruning-56173803
- https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4980076/
- https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb
- https://www.nltk.org/book/ch06.html
- https://sefiks.com/2018/10/27/how-pruning-works-in-decision-trees/
Relational Data Interoperability
- http://www.kde.cs.tsukuba.ac.jp/~masa/papers/thesis.pdf
- http://cmj4.web.rice.edu/mat_vec.pdf
- https://core.ac.uk/download/pdf/34329967.pdf
- https://towardsdatascience.com/set-theory-basic-notation-da93c3d48090
Full Text Faceted Search Engine Marketing
- https://lucene.apache.org/solr/guide/7_5/machine-learning.html
- https://findwise.com/blog/improve-search-relevance-using-machine-learning-statistics-apache-solr-learning-rank/
- https://www.hillstonenet.com/blog/a-hybrid-approach-to-detect-malicious-web-crawlers/
- http://www.cs.stir.ac.uk/~kms/schools/rps/index.php
- https://www.aclweb.org/anthology/D07-1086
- https://searchengineland.com/google-extends-same-meaning-close-variants-to-phrase-match-broad-match-modifiers-320138
- https://www.datacamp.com/community/tutorials/sem-data-science
- https://moz.com/blog/google-vs-bing-correlation-analysis-of-ranking-elements
- https://itnext.io/apache-solr-because-your-database-is-not-a-search-engine-57705352df8a
- https://content.iospress.com/articles/data-science/ds007
- https://spidermonkey.ca/robot.shtml
- https://docs.oracle.com/cd/E05317_01/psft/acrobat/dm0682.pdf
NLP
- https://luckytoilet.wordpress.com/2018/01/01/real-world-applications-of-automaton-theory/
- https://en.wikipedia.org/wiki/Bag-of-words_model
- https://www.npmjs.com/package/search-index
- https://davidwalsh.name/open-search
- http://2ality.com/2013/06/chrome-omnibox-search.html
- https://docs.servicenow.com/bundle/london-platform-administration/page/integrate/inbound-other-web-services/task/t_BuildSearchProviderForInstance.html
- https://towardsdatascience.com/predicting-logics-lyrics-with-machine-learning-9e42aff63730
- https://www.newscientist.com/article/2115684-machine-learning-lets-computer-create-melodies-to-fit-any-lyrics/
- https://medium.com/@yashka.troy/alphabet-array-solving-anagrams-fc5e1ac68431
- https://www.scirp.org/journal/PaperInformation.aspx?PaperID=20943
- https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=3813&context=etd
- http://www.lwebzem.com/cgi-bin/courses/course_view.cgi?c=naive_bayes_classification_course.cgi
Music
NLG
- title:machine title:learning title:lyrics -ext:pdf
- https://github.com/llSourcell/Rap_Lyric_Generator/blob/master/MarkovRap.py
- https://www.cjr.org/tow_center_reports/guide_to_automated_journalism.php/
- https://www.salesforce.com/blog/2016/09/artificial-intelligence-helps-small-businesses.html
- https://pdfs.semanticscholar.org/08ed/cd794d534450f46ba5969f3e4098a0b4c744.pdf
- https://www.import.io/post/neural-nets-how-regular-expressions-brought-about-deep-learning/
- http://www.agence-nationale-recherche.fr/Project-ANR-14-CE24-0033
Gender Prediction
- https://pypi.org/project/gender-guesser/
- https://github.com/appeler/ethnicolr
- https://www.namsor.com/
- https://stephenholiday.com/articles/2011/gender-prediction-with-python/
- https://www.kdnuggets.com/2015/11/machine-learning-predict-gender.html
- https://github.com/alecglassford/compciv-2016/blob/master/projects/gender-detector-data/README.md
Codegen
- https://www.functionize.com/blog/robot-framework-a-closer-look-at-keyword-driven-testing-approach/
- https://www.codeproject.com/Articles/1156694/A-Look-into-the-Future-Source-Code-Generation-by-t
- https://www.red3d.com/cwr/steer/
Datasets
- https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
- https://archive.ics.uci.edu/ml/datasets/letter+recognition
- https://medium.com/@saframpton/our-lossy-alphabet-a59516e8c3fb
- https://github.com/angular/code.angularjs.org/blob/master/1.4.9/docs/js/search-data.json
Packages
- https://www.predictiveanalyticstoday.com/top-artificial-neural-network-software/
- https://www.kaggle.com/vimota/getting-started-with-bigquery-ml-in-r
Classifiers
- https://blog.feedly.com/data-science-behind-recommendations-in-feedly/
- https://pdfs.semanticscholar.org/ccba/c9c9cda72b27bfda0d780be86da744b6ce7c.pdf
- https://www.supermarketguru.com/articles/the-first-of-its-kind-study-shows-ai-tool-can-improve-best-practices-in-managing-nut-allergies/
- https://www.businessinsider.com/how-to-say-hello-around-the-world-2015-8
- https://growth.wingify.com/what-you-need-to-know-before-you-board-the-machine-learning-train-a81c513098fe
- https://blog.fastforwardlabs.com/2017/03/09/fairml-auditing-black-box-predictive-models.html
- http://inverseprobability.com/talks/notes/the-three-ds-of-machine-learning.html
- https://www.scienceabc.com/innovation/how-did-the-nintendo-game-duck-hunt-work.html
- https://venturebeat.com/2019/01/18/machine-learning-is-rescuing-old-game-textures-in-zelda-and-final-fantasy/
- https://www.quora.com/Which-machine-learning-algorithms-do-Super-Smash-Bros-Wii-U-Amiibos-utilize
- https://towardsdatascience.com/how-to-win-over-70-matches-in-rock-paper-scissors-3e17e67e0dab
Overkill / Over Engineering
options:
- name: "GenAI"
suited_for: "Large-scale creative generation tasks"
cpu_cost: "High"
gpu_cost: "High"
cloud_IDE_time: "High"
money_efficiency: "Low"
comment: "Overkill for a CSV dataset of 500 rows."
- name: "Discriminative AI"
suited_for: "Large-scale classification or predictive tasks"
cpu_cost: "Medium to High"
gpu_cost: "High"
cloud_IDE_time: "High"
money_efficiency: "Low"
comment: "Not cost-effective for only 500 rows."
- name: "Machine Learning"
suited_for: "Statistical modeling and predictions on moderate to large datasets"
cpu_cost: "Moderate to High"
gpu_cost: "Potentially high if complex models are used"
cloud_IDE_time: "High (due to setup and iterative tuning)"
money_efficiency: "Low to Moderate"
comment: "The added overhead does not justify its use for such a small dataset."
- name: "MechanicalTurk"
suited_for: "Human-powered tasks (e.g., annotation) rather than computation"
cpu_cost: "N/A"
gpu_cost: "N/A"
cloud_IDE_time: "Not applicable"
money_efficiency: "Low (costs for human labor are high)"
comment: "Not applicable for computational analysis of the dataset."
- name: "Pandas"
suited_for: "Data manipulation and analysis on small to moderate datasets"
cpu_cost: "Very low"
gpu_cost: "None required"
cloud_IDE_time: "Minimal (runs locally on a standard CPU)"
money_efficiency: "High (free and open source)"
comment: "Ideal choice for a 500-row CSV dataset."
- name: "Spreadsheet Macro"
suited_for: "Simple data tasks in tools like Excel or Google Sheets"
cpu_cost: "Very low"
gpu_cost: "Not applicable"
cloud_IDE_time: "Minimal (often built into desktop or web apps)"
money_efficiency: "High (if software/subscriptions are already in use)"
comment: "Works well for basic tasks but lacks the flexibility of Pandas for more in-depth analysis."
Deepfake Detection
- https://regtechafrica.com/thaless-friendly-hackers-unit-invents-metamodel-to-detect-ai-generated-deepfake-images/
- https://www.ycombinator.com/companies/reality-defender
Fantasy Basketball
- https://basketball.fantasysports.yahoo.com/nba/167497/
- https://developer.yahoo.com/fantasysports/guide/
Fantasy Football
- https://archive.fantasysports.yahoo.com/nfl/2016/560404/
- https://www.nfl.com/playerhealthandsafety/resources/press-releases/top-finishers-in-nfl-data-challenge-improve-league-s-ability-to-predict-injuries
Dictionaries
Chatbots
- https://chatbotslife.com/text-classification-using-algorithms-e4d50dcba45
- https://www.codeproject.com/Articles/12454/Developing-AI-chatbots
- https://www.tutorialspoint.com/aiml/
- https://blog.publicinput.com/news/blog/introducing-kevin-kamto-and-machine-learning-92c8cd0d67cf
- https://chatbotsmagazine.com/what-is-the-working-of-a-chatbot-e99e6996f51c
- https://www.marutitech.com/why-can-chatbots-replace-mobile-apps-immediately/
- https://www.gamasutra.com/view/feature/132155/beyond_aiml_chatbots_102.php
- https://easydita.com/a-chatbot-maturity-model/
- https://www.sam-solutions.com/blog/java-is-it-the-best-language-for-artificial-intelligence/
- https://chatterbot.readthedocs.io/en/stable/tutorial.html
Slackbots
- https://medium.com/slack-developer-blog/conversing-with-ai-on-slack-5af2561f98a5
- https://medium.com/@SAPCAI/a-natural-language-slackbot-19ca5b0fc64b
- https://uxplanet.org/design-lessons-from-building-an-ai-slackbot-for-uncle-brian-73f5a9b1fe89
- https://dzone.com/articles/build-a-scheduler-slackbot-in-30-minutes
Finite State Machine
- https://borjaballe.github.io/other/phdthesis.pdf
- https://dzone.com/articles/neural-networks-and-automata-theory
- https://www.quora.com/Arent-Neural-Networks-just-State-Machines
- https://www.bennadel.com/blog/2241-parsing-csv-data-with-an-input-stream-and-a-finite-state-machine.htm
- https://papers.nips.cc/paper/757-fools-gold-extracting-finite-state-machines-from-recurrent-network-dynamics.pdf
Relational Data
- https://machinelearningmastery.com/large-data-files-machine-learning/
- https://towardsdatascience.com/why-does-ai-ml-considering-the-examples-of-chatbots-creation-20b1906274f8
Semantic Data
- https://www.kdnuggets.com/2015/05/webdatacommons-data-web-scale-mining.html
- https://lemire.me/blog/2014/12/02/when-bad-ideas-will-not-die-from-classical-ai-to-linked-data/
- https://www.anomalo.com/blog/data-quality-in-machine-learning-best-practices-and-techniques/
- https://spotintelligence.com/2023/04/07/data-quality-machine-learning/
Recognition
- https://www.asug.com/news/costco-bakes-machine-learning-into-a-tasty-customer-experience
- https://gdpr.report/news/2017/08/23/deep-learning-not-ai-future/
- https://www.codeproject.com/Articles/1273113/Apple-tron-an-AI-for-farmers
- https://rationalwiki.org/wiki/Machine_learning
- https://www.oreilly.com/library/view/natural-language-annotation/9781449332693/ch04.html
- https://www.quora.com/How-do-I-design-a-system-to-query-the-database-based-on-natural-language-input
Video Games
- https://www.gamasutra.com/view/news/269634/7_examples_of_game_AI_that_every_developer_should_study.php
- https://bleacherreport.com/articles/2796233-inside-nba2ks-journey-to-the-top-of-sports-gaming
- https://www.gamasutra.com/view/news/296245/Have_a_peek_inside_the_AI_code_of_Street_Fighter_II.php
- https://www.gamasutra.com/view/news/316060/Blizzard_experiments_with_machine_learning_to_fight_Overwatch_toxicity.php
- https://www.gamasutra.com/view/news/336455/Learn_how_machine_learning_can_help_you_make_better_games_at_GDC_2019.php
- https://www.gamasutra.com/blogs/BenWeber/20190426/340293/PortfolioScale_Machine_Learning_atZynga.php
- https://towardsdatascience.com/an-exploration-of-neural-networks-playing-video-games-3910dcee8e4a
- https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
- https://www.freecodecamp.org/news/how-to-use-ai-to-play-sonic-the-hedgehog-its-neat-9d862a2aef98/
Prediction
- https://www.hackerfactor.com/GenderGuesser.php#Analyze
- https://dalibornasevic.com/posts/61-intro-to-machine-learning-in-ruby
- http://blogs.perl.org/users/sergey_kolychev/2017/02/machine-learning-in-perl.html
- https://www.andrewthompson.co/2012/12/prototyping-with-googles-prediction-api.html
- https://php-ml.readthedocs.io/en/latest/machine-learning/datasets/csv-dataset/
- https://code.msdn.microsoft.com/windowsdesktop/Getting-Started-with-34722da0#content
- http://pdl.perl.org/?docs=FAQ
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111592/
- https://automatedinsights.com/customer-stories/associated-press/
- https://www.slideshare.net/judederick/whitepaper-1-butterfly-effect-and-big-data
- https://gigazine.net/gsc_news/en/20180420-machine-learning-predict-chaos/
- http://www.stsci.edu/~lbradley/seminar/butterfly.html
- https://www.quantamagazine.org/machine-learnings-amazing-ability-to-predict-chaos-20180418/
- https://en.wikipedia.org/wiki/Trivia
- https://en.wikipedia.org/wiki/Shahnameh
Biased Data
- https://www.inc.com/kit-eaton/why-it-matters-that-googles-ai-gemini-chatbot-made-death-threats-to-a-grad-student/91019626
- https://www.scienceabc.com/innovation/what-is-moravecs-paradox-definition.html
- https://swangroup.net/podcast/diverse-ai-with-rajvir-madan/
Deepfakes
- https://apps.dtic.mil/sti/citations/trecms/AD1180860
- https://apps.dtic.mil/sti/trecms/pdf/AD1180860.pdf
- https://apps.dtic.mil/sti/trecms/pdf/AD1178469.pdf
- https://www.darpa.mil/news-events/2024-03-14
- https://www.youtube.com/watch?v=10ENWUrzj-o
- https://www.darpa.mil/news-events/2023-06-16
Missing Data
Invention | Inventor | Location | Family | Heirs | Legacy | Reason for Unknowns |
---|---|---|---|---|---|---|
Wheel-lock Musket | Unknown | Europe | Not specifically documented | Not specifically documented | The invention paved the way for future advancements in firearm design, influencing the development of more sophisticated ignition mechanisms such as the flintlock. | The specific inventor is not documented due to the collaborative and evolving nature of firearm technology during this period. |
Horizontal Water Wheel | Unknown | Europe | Not specifically documented | Not specifically documented | Its broader application and inventor of this specific design are not clearly documented. | The broader application and specific inventor are not documented, as many designs were often conceptualized and adapted by various individuals over time. |
Vector Features in RDBMSs
Database | Feature | Description | URL | Version Introduced |
---|---|---|---|---|
PostgreSQL | pgvector | An open-source extension that adds support for vector operations and similarity searches. | pgvector | 12.4 |
MySQL | MySQL HeatWave | Includes support for vector store and generative AI capabilities, performing similarity searches with LLMs. | HeatWave | 8.0 |
MariaDB | MariaDB Vector | Allows storing and searching vector data using a modified HNSW algorithm for fast similarity searches. | Vector | 11.7 |
Sybase | Sybase Features | Currently does not have built-in vector database features similar to PostgreSQL, MySQL, and MariaDB. | Sybase Features | N/A |
Teradata | Teradata Features | Teradata provides advanced vector capabilities for data analysis and machine learning applications. | Teradata Features | 16.20 |
Bad Fit
What types of data is most poorly labeled among publicly traded companies?
Publicly traded companies often struggle with labeling certain types of data accurately. Some of the most poorly labeled data include:
- Soft Information: This includes intangible assets like the value of research and development, employee training, and morale. These are difficult to quantify and often lead to inconsistencies in reporting.
- Financial Data: Despite efforts to standardize financial reporting with formats like XBRL (Extensible Business Reporting Language), there are still issues with comparability and accuracy.
- Non-Financial Metrics: Data related to environmental, social, and governance (ESG) factors can be inconsistently labeled and reported, leading to difficulties in comparison and analysis.
Would you like to know more about how companies can improve their data labeling practices?
Bad Data
create_llm_with_bad_quality_data:
steps_and_pitfalls:
data_collection:
poor_data_sources: "Using unreliable or unverified sources can result in collecting irrelevant or incorrect information."
lack_of_diversity: "If the data lacks diversity in language, style, and context, the model will struggle to generalize and understand different inputs."
data_preprocessing:
minimal_cleaning: "Not cleaning or preprocessing data properly leads to noisy inputs, including spelling mistakes, grammatical errors, and inconsistent formatting."
biased_data: "Training with biased data can reinforce harmful stereotypes and provide skewed responses."
model_training:
overfitting: "Training with poor quality data can cause overfitting, where the model performs well on the training data but poorly on new, unseen data."
low_accuracy: "The model will have low accuracy and poor generalization capabilities due to the flawed input data."
evaluation_and_tuning:
inaccurate_evaluation: "Evaluating the model with low-quality validation data results in misleading performance metrics."
poor_tuning: "Inadequate hyperparameter tuning can further degrade the model's performance."
consequences:
unreliable_outputs: "The model will generate inaccurate and unreliable responses, undermining its usefulness."
reinforcement_of_biases: "Using biased data can perpetuate and amplify existing biases."
increased_risks: "Deploying such a model can lead to misinformation and ethical concerns."
Data Quality Limitations
Physical world games of chance have triggered absurd responses from machine learning algorithms, such as this:
import random
from collections import defaultdict
# Initialize move history
move_history = []
# Function to predict the next move
def predict_next_move(history):
if len(history) < 2:
return random.choice(['rock', 'paper', 'scissors'])
# Analyze the last two moves
last_move = history[-1]
second_last_move = history[-2]
# Predict based on pattern
if second_last_move == 'rock' and last_move == 'paper':
return 'scissors'
elif second_last_move == 'paper' and last_move == 'scissors':
return 'rock'
elif second_last_move == 'scissors' and last_move == 'rock':
return 'paper'
return random.choice(['rock', 'paper', 'scissors'])
# Simulate a game
for _ in range(10):
opponent_move = random.choice(['rock', 'paper', 'scissors'])
move_history.append(opponent_move)
predicted_move = predict_next_move(move_history)
print(f"Opponent Move: {opponent_move}, Predicted Move: {predicted_move}")
Purposeful Exacerbation of Logical Fallacies in Algorithms
When algorithms are purposefully designed to exacerbate logical fallacies like the Gambler's Fallacy and the Hot-Hand Fallacy, it is often to exploit certain behaviors or patterns for specific outcomes, typically in trading or investment contexts. Here are some ways it might happen:
Market Manipulation
Algorithms can be designed to create artificial demand or supply in the market by repeatedly trading based on recent performance, making it appear as though certain stocks are trending. This can mislead other traders into believing there is a sustained trend, which isn't actually based on underlying fundamentals.
Reinforcing Biases
By emphasizing recent trends and neglecting the inherent randomness, these algorithms can exploit the Hot-Hand Fallacy. This can drive up the prices of certain assets artificially, creating a bubble that savvy traders might plan to exploit by shorting once the market corrects itself.
Echo Chamber Effect
Algorithms that amplify the Gambler's Fallacy might focus on past losses or downturns, driving prices down further than warranted by fundamentals. This can lead to undervaluation of assets, which can be exploited later when market corrections occur.
Encouraging Risky Behavior
By giving undue weight to recent successes, such algorithms can encourage investors to make increasingly risky bets, believing their 'streak' will continue. This can lead to greater market volatility, which can be advantageous to certain trading strategies.
High-Frequency Trading (HFT)
In HFT, algorithms may exploit micro-trends by making thousands of trades per second, based on minute price movements and past performance trends. This can distort market prices and create opportunities for profit by capitalizing on these artificial fluctuations.
It's worth noting that such practices are often scrutinized and regulated to prevent market abuse and protect investors.
Data Cleaning
https://www.kaggle.com/code/loganlauton/basic-data-clean-helper-nba-players-team-data
Case Studies on Garbage Data
studies:
- title: "Quantifying Outlierness of Funds from their Categories using Supervised Similarity"
description: "This study explores the impact of miscategorization in mutual funds using a machine learning approach. The researchers found a strong relationship between miscategorization and future returns, highlighting the significant implications for allocation decisions and investment fund managers."
url: "https://arxiv.org/abs/2003.02924"
- title: "Bias and Unfairness in Machine Learning Models: A Systematic Review"
description: "This systematic review examines the current knowledge on bias and unfairness in machine learning models. It discusses various datasets, tools, fairness metrics, and methods for identifying and mitigating bias. The review emphasizes the importance of addressing miscategorization to ensure fair and unbiased models."
url: "https://www.mdpi.com/2076-3417/10/18/6462"
- title: "Evolution and Impact of Bias in Human and Machine Learning Algorithm Interaction"
description: "This research investigates the iterative interaction between humans and machine learning algorithms. The study highlights how biased data and miscategorization can lead to algorithmic bias, which can further exacerbate the problem through iterative processes."
url: "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226801"
Human Friendly, Machine Unfriendly
Legacy MIME types that are completely unusable for machine learning are typically those that don't contain any meaningful data or are not structured in a way that machine learning algorithms can process. Examples include:
- **application/octet-stream**: This is a generic binary data format that doesn't provide any information about the content, making it unusable for machine learning without additional processing.
- **application/x-unknown**: This MIME type indicates that the content type is unknown, which means there's no specific format or structure to work with.
- **text/plain**: While plain text can be used for some machine learning tasks, it lacks the structure and richness of more complex data formats like JSON or XML, making it less useful for many applications.
In general, for machine learning, you want data in structured formats like JSON, XML, CSV, or specific file formats like images (JPEG, PNG) or audio (MP3, WAV) that can be easily parsed and processed by algorithms.
Unusable Data
MetaDatasets:
- IncompleteData:
Description: "Datasets heavily fragmented with significant amounts of missing data."
- NonQuantifiableData:
Description: "Data that cannot be quantified or is purely qualitative without any structured format."
Example: "Vague descriptions without standardized terms."
- InconsistentData:
Description: "Data that lacks consistency in formats, units, or types."
Example: "Mixing numerical values with text in the same column."
- PoorQualityScansForOCR:
Description: "Handwritten notes or documents that are poorly scanned, blurry, or too faint."
- RandomOrNoisyData:
Description: "Data that appears random or contains a high level of noise with no discernible patterns."
- OutdatedOrIrrelevantData:
Description: "Data that is too old or not relevant to the current context or domain."
- UnstructuredTextData:
Description: "Free-form text data without categorization, tagging, or structuring."
Java
machine_learning_libraries:
- name: "Weka"
description: "Popular machine learning library written in Java. Provides tools for data pre-processing, classification, regression, clustering, and visualization. Suitable for smaller-scale ML tasks on a Raspberry Pi."
github_url: "https://github.com/weka/weka"
- name: "Deeplearning4j"
description: "Deep learning library for Java and Scala. Designed to be scalable and run on various hardware, including Raspberry Pi. Optimizations and lightweight neural networks are recommended for efficient performance."
github_url: "https://github.com/deeplearning4j/deeplearning4j"
- name: "Java-ML"
description: "Machine learning library for Java offering algorithms for classification, clustering, regression, and more. Relatively lightweight and suitable for running on a Raspberry Pi."
github_url: "https://github.com/AbeelLab/javaml"
- name: "MOA (Massive Online Analysis)"
description: "Framework for data stream mining and machine learning, written in Java. Used for real-time analytics on the Raspberry Pi. Designed to handle large-scale data streams efficiently."
github_url: "https://github.com/Waikato/moa-2014"
Sports and Stocks
1. Associative Property
- Sports: In team sports, the order in which players or strategies are combined can lead to the same outcome. For example, the outcome of a relay race is the same regardless of the order in which the team's legs are combined. Learn more about Associative Property
- Stocks: In portfolio management, the order in which returns are compounded over time follows the associative property. For example, the overall return of a portfolio is the same regardless of the order in which individual stock returns are compounded. Learn more about Associative Property
2. Commutative Property
- Sports: In scoring systems, the order of scoring can be changed without affecting the total score. For example, a basketball team scoring 2 points and then 3 points results in the same total score as scoring 3 points and then 2 points. Learn more about Commutative Property
- Stocks: The commutative property applies to the addition of returns in a diversified portfolio. For example, the sum of returns from two stocks is the same regardless of the order in which they are added. Learn more about Commutative Property
3. Distributive Property
- Sports: In resource allocation, distributing resources (e.g., energy, time, training) across different players or activities follows the distributive property. For example, the total training time can be distributed across different exercises for each player. Learn more about Distributive Property
- Stocks: In financial analysis, distributing investment across different asset classes follows the distributive property. For example, the overall risk of a portfolio can be distributed across stocks, bonds, and other assets. Learn more about Distributive Property
4. Linear Regression
- Sports: Linear regression is used to predict player performance based on various factors such as training, past performance, and physical attributes. Learn more about Linear Regression
- Stocks: Linear regression is used to analyze the relationship between stock prices and various economic indicators, helping to predict future stock prices. Learn more about Linear Regression
5. Game Theory
- Sports: Game theory is used to analyze strategic interactions between teams or players, helping to determine optimal strategies in competitive scenarios. Learn more about Game Theory
- Stocks: Game theory is used to analyze the interactions between market participants, helping to determine optimal trading strategies in competitive markets. Learn more about Game Theory
6. Optimization
- Sports: Optimization techniques are used to improve team performance, player training schedules, and game strategies. Learn more about Optimization
- Stocks: Optimization techniques are used in portfolio management to maximize returns and minimize risk. Learn more about Optimization
7. Probability and Statistics
- Sports: Probability and statistics are used to analyze player and team performance, predict game outcomes, and make strategic decisions. Learn more about Probability and Statistics
- Stocks: Probability and statistics are used to analyze market trends, predict stock prices, and assess investment risks. Learn more about Probability and Statistics
Mathematics and Large Language Models (LLMs)
When it comes to mathematics and large language models (LLMs), there are some interesting considerations:
-
Mathematical Operations: As we discussed, mathematical operations can be idempotent in certain contexts. For example, adding zero to a number repeatedly or multiplying a number by one doesn't change the result.
-
Consistency in Responses: While LLMs are not inherently idempotent due to their probabilistic nature, they can still provide consistent responses for straightforward mathematical queries. For example, if you ask for the sum of 2 + 2, the model should consistently respond with 4.
-
Complex Calculations: For more complex mathematical problems, LLMs may use built-in functions or external tools to perform calculations. While the underlying computations are consistent, the model's response might vary slightly in phrasing or presentation due to its language generation capabilities.
ASCII
Artist / Group | Active Era | Notable For | URL / Reference |
---|---|---|---|
Early Anonymous ASCII Pioneers | 1960s–1970s | Pioneering the creation of computer-based text art on early mainframes and teletype systems. | SCI Python – ASCII Art |
Scott Fahlman | 1982 | Coining the first emoticon (:-)) as a simple form of text-based expression—a precursor to broader ASCII art culture. | Wikipedia: Emoticon |
Joan Stark (jgs) | 1990s | Compiling and popularizing a vast collection of intricate ASCII art; her work remains one of the most recognized online. | ASCII Art Archive at asciiart.eu |
Modern ASCII Art Community | 2000s–Present | A diverse group of online contributors using updated tools and techniques to keep the ASCII art tradition alive. | ASCII Everything |
XML Dialects
xml_dialects:
- name: AIML (Artificial Intelligence Markup Language)
url: http://www.aiml.foundation/doc.html
creation_date: 2001
last_update_date: 2018
- name: Collada (Collaborative Design Activity)
url: https://www.khronos.org/collada/
creation_date: 2004
last_update_date: 2016
- name: CityGML
url: https://www.ogc.org/publications/standard/citygml/
creation_date: 2008
last_update_date: 2012
- name: BeerXML
url: http://beerxml.com/beerxml.htm
creation_date: 2003
last_update_date: Unknown
- name: CellML
url: https://www.cellml.org/specifications/cellml_2.0
creation_date: 1999
last_update_date: 2019
- name: XBRL (eXtensible Business Reporting Language)
url: https://specifications.xbrl.org/
creation_date: 1998
last_update_date: 2023
- name: IMS Content Packaging
url: https://www.imsglobal.org/specifications.html
creation_date: 2000
last_update_date: 2022
- name: Akoma Ntoso
url: https://docs.oasis-open.org/legaldocml/akn-core/v1.0/akn-core-v1.0-part2-specs.html
creation_date: 2004
last_update_date: 2018
- name: CMIS (Content Management Interoperability Services)
url: https://specifications.oasis-open.org/cmis/
creation_date: 2008
last_update_date: 2017
- name: DITA (Darwin Information Typing Architecture)
url: https://www.oasis-open.org/committees/dita/
creation_date: 2005
last_update_date: 2023
- name: OPML (Outline Processor Markup Language)
url: https://dev.opml.org/
creation_date: 2000
last_update_date: 2022
- name: MSBuild (Microsoft Build Engine)
url: https://learn.microsoft.com/en-us/visualstudio/msbuild/msbuild
creation_date: 2003
last_update_date: 2023
- name: OpenSearch
url: https://opensearch.org/
creation_date: 2005
last_update_date: 2021
- name: EPUB (Electronic Publication)
url: https://www.w3.org/publishing/epub3/
creation_date: 1999
last_update_date: 2023
Limitations
comparison:
concepts:
Wet-on-Wet Painting:
description: "Involves continuous blending of colors on a wet surface, allowing for organic transitions."
plateaus:
- plateaus due to formulaic use: "Over-reliance on predictable blending methods results in paintings that lack variation or originality."
- loss of skill refinement: "Artists may neglect controlled brush techniques, reducing their ability to create detailed or layered effects."
- handicap in manual execution: "Without intentional layering discipline, artists struggle with dry brush techniques or structured painting styles."
- over adaptive to external inputs: "Environmental conditions (humidity, drying time) dictate the painting process more than deliberate artist control."
- creativity restrictions: "The expectation of fluid blending limits experimental approaches such as defined edges, hard contrasts, and unconventional textures."
Coding with AI Assistance:
description: "Developers interact dynamically with AI-generated suggestions, refining code iteratively."
plateaus:
- plateaus due to formulaic use: "Repeated reliance on AI-suggested patterns leads to generic coding structures, reducing innovation in problem-solving."
- loss of skill refinement: "Critical coding techniques like algorithm design, memory optimization, and debugging may weaken over time."
- handicap in manual execution: "Developers struggle when coding without AI prompts, finding it harder to construct solutions independently."
- over adaptive to external inputs: "AI biases in suggestions may override better manual approaches, resulting in suboptimal code structures."
- creativity restrictions: "Programmers may avoid unconventional coding patterns or experimental solutions that AI does not readily suggest."
GPS-Based Driving:
description: "Real-time navigation adjusts dynamically based on external inputs like traffic and road conditions."
plateaus:
- plateaus due to formulaic use: "Drivers default to GPS routes instead of exploring alternatives, limiting their awareness of geography and route diversity."
- loss of skill refinement: "Ability to manually plan routes, read road signs, and estimate travel times deteriorates over time."
- handicap in manual execution: "In situations where GPS fails, drivers struggle to navigate using maps, intuition, or spatial reasoning."
- over adaptive to external inputs: "Drivers become overly dependent on live traffic updates, reacting passively rather than proactively choosing efficient paths."
- creativity restrictions: "Rigid adherence to suggested routes prevents improvisation, such as scenic detours or alternate paths that might be more efficient."
comparison_summary: "Each technique—painting, coding with AI assistance, and GPS-based driving—faces plateaus when used repetitively or with excessive dependence on automated suggestions. Over-reliance can lead to skill degradation, reduced manual problem-solving ability, and diminished creativity."
footer:
links:
Wet-on-Wet Painting Overview: "https://www.arts.gov/stories/blog/2021/wet-wet-technique-artistic-expression"
AI-Assisted Coding Overview: "https://www.nist.gov/news-events/news/2023/ai-assistance-coding-nist-insights"
GPS Navigation Algorithms: "https://www.transportation.gov/research-and-technology/gps-navigation-impact-traffic"
Paywall Trade Secrets
API Name | Primary Use Case | Access Requirements | Documentation Availability | Alternative Public API (Limited Scope) |
---|---|---|---|---|
Xignite API | Market data, stock quotes, financial analytics | Paid subscription required | Limited public details; full API behind paywall | Alpha Vantage |
Bloomberg Terminal API | Real-time financial data, analytics, trading insights | Requires Bloomberg Terminal subscription | Restricted to Bloomberg clients | Twelve Data |
FactSet API | Financial research, portfolio management | Enterprise-level subscription | Only available to FactSet clients | Financial Modeling Prep |
Morningstar Direct API | Investment research, fund analysis | Morningstar Direct subscription | Limited public access; full API restricted | Quandl Free Tier |
S&P Capital IQ API | Company fundamentals, market insights | S&P Capital IQ subscription | Restricted to paying customers | OpenFIGI |
Refinitiv Eikon API | Market data, financial analytics, trading tools | Requires Refinitiv Eikon subscription | Limited public details; full API behind paywall | IEX Cloud |
Quandl Premium APIs | Alternative financial data, economic indicators | Paid subscription for premium datasets | Limited details for premium datasets | EOD Historical Data |
Benchmarking
Security
- https://techcommunity.microsoft.com/blog/microsoft-security-blog/understanding-and-mitigating-security-risks-in-mcp-implementations/4404667
- https://strobes.co/blog/mcp-model-context-protocol-and-its-critical-vulnerabilities/
- https://github.com/nonsleepr/mcp-cve-search
- https://www.tenable.com/blog/faq-about-model-context-protocol-mcp-and-integrating-ai-for-agentic-applications