Machine Learning - sgml/signature GitHub Wiki
Data preprocessing involves gathering and cleaning data, which is essential for any machine learning task.
In new research accepted for publication in Chaos, they showed that improved predictions of chaotic systems like the Kuramoto-Sivashinsky equation become possible by hybridizing the data-driven, machine-learning approach and traditional model-based prediction. Ott sees this as a more likely avenue for improving weather prediction and similar efforts, since we don’t always have complete high-resolution data or perfect physical models. “What we should do is use the good knowledge that we have where we have it,” he said, “and if we have ignorance we should use the machine learning to fill in the gaps where the ignorance resides.”
AI is skilled at games Machine learning is skilled at statistics "Deep learning is also highly susceptible to bias. When Google's facial recognition system was initially rolled out, for instance, it tagged many black faces as gorillas.
Child Psychology
- https://www.hackster.io/Fryden-Learning/ai-and-machine-learning-for-kids-2baa1f
- https://machinelearningforkids.co.uk/#!/worksheets
- https://www.commonsense.org/education/top-picks/best-coding-tools-for-middle-school
- https://www.stevemurch.com/machine-learning-ai-for-kids-resources/2018/12
- https://blog.ozobot.com/parenting/parents-can-explain-artificial-intelligence-machine-learning-kids/
- https://www.samplereality.com/2015/09/05/your-mistake-was-a-vital-strategy/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6287292/
- https://www.pnas.org/content/105/13/5012
- https://scienceblogs.com/cognitivedaily/2009/04/29/do-we-reason-with-statistics-i
- https://datascience.aero/how-much-data-you-need/
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.878.1367&rep=rep1&type=pdf
- http://www.cogsci.ucsd.edu/~deak/cdlab/Publication/LietalTR2009.pdf
Storytelling
I think stories are what make us different from chimpanzees and Neanderthals. And if story-understanding is really where it’s at, we can’t understand our intelligence until we understand that aspect of it.
- http://nautil.us/issue/75/story/the-storytelling-computer
- http://people.csail.mit.edu/phw/index.html
- https://www.memoriesofpatrickwinston.com/
- https://phys.org/news/2017-12-arrow-relative-concept-absolute.html
Obscurity
writers_editors:
- name: "Walter Bagehot (New Zealand)"
url: "https://www.economicshelp.org/blog/26107/economics/walter-bagehot/"
- name: "Benjamin Constant (France)"
url: "https://plato.stanford.edu/entries/constant/"
- name: "Giuseppe Mazzini (Italy)"
url: "https://spartacus-educational.com/ITmazzini.htm"
- name: "Juan Pablo II (Poland)"
url: "https://www.jstor.org/stable/10.5325/jjohnpajstud.4.2.0001"
- name: "José Martí (Cuba)"
url: "https://www.jstor.org/stable/30209112"
Scaffolding
- https://www.forbes.com/sites/janakirammsv/2018/01/01/why-do-developers-find-it-hard-to-learn-machine-learning/#4237864f6bf6
- https://marutitech.com/problems-solved-machine-learning/
- https://www.linkedin.com/pulse/cat-teaching-computers-how-learn-mike-volpi/
- https://machinelearningmastery.com/youre-wrong-machine-learning-not-hard/
- https://www.robinwieruch.de/machine-learning-javascript-web-developers/
- https://law.vanderbilt.edu/files/archive/Judicial_Intuition.pdf
- http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
- https://www.gamasutra.com/view/feature/130578/visual_finite_state_machine_ai_.php
- https://www.codeproject.com/Articles/1275031/Why-Real-Neurons-Learn-Faster
- https://www.quora.com/Is-there-a-way-to-use-Machine-Learning-to-predict-the-outcome-of-a-coin-toss
- https://medium.com/@davidllorente/nlg-technologies-artificial-intelligence-vs-rule-base-approach-cf8e9992461e
- https://medium.com/@davidllorente/automatic-natural-language-generation-the-new-normal-cd36ed8976de
- https://blog.wolfram.com/2010/11/30/how-to-win-at-coin-flipping/
- https://medium.com/@mikeharrisNY/the-heist-is-the-coin-toss-77ee4d870037
- https://docs.aws.amazon.com/silk/latest/developerguide/machine-learning.html
Concepts
- https://developers.google.com/machine-learning/glossary/generative
- http://people.csail.mit.edu/phw/mit.html
- https://web.archive.org/web/20240923014939/https://media.licdn.com/dms/image/v2/D4E22AQGaPFklcD4qHg/feedshare-shrink_800/feedshare-shrink_800/0/1725535189735?e=1729728000&v=beta&t=oYF0O87cI6zVZS9F-rAF5A9V-nI13ScqS5IPsz2oA1o
- https://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
- https://www.sqlservercentral.com/articles/machine-learning-101-the-mathematics-of-an-artificial-neural-network-6
- http://inverseprobability.com/talks/notes/probabilistic-machine-learning.html
- https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01120/full
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2722922/
- https://www.hindawi.com/journals/complexity/2019/2952304/
- http://www.alice.id.tue.nl/references/kahnemann-2003.pdf
- https://stackoverflow.com/questions/616292/is-it-possible-for-a-computer-to-learn-a-regular-expression-by-user-provided-e
- https://ruccs.rutgers.edu/images/personal-rochel-gelman/publications/Obrecht_Chapman_Gelman_2007_Intuitive_t_Tests_Lay_use_of_statistical_information.pdf
- https://ir.lib.uwo.ca/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1230&context=etd
- https://en.wikipedia.org/wiki/Sortition
- https://www.nature.com/articles/ncomms15694:
- https://en.wikipedia.org/wiki/Logic_learning_machine
- https://en.wikipedia.org/wiki/Hierarchical_classifier
- https://en.wikipedia.org/wiki/Quasi-likelihood
- https://www.microsoft.com/en-us/research/wp-content/uploads/2017/02/unknown_unknowns_identify_algo.pdf
- https://www.quora.com/Are-artificial-intelligence-compiler-theory-and-Automata-related-together-If-so-then-how
Research
- https://elitedatascience.com/machine-learning-algorithms
- https://www.riskiq.com/blog/external-threat-management/machine-learning-silver-bullet/
- https://medium.com/data-from-the-trenches/marketing-attribution-tutorial-part-2-7bb78dec502
- https://www.ibm.com/developerworks/community/blogs/jfp/entry/Feature_Engineering_For_Deep_Learning?lang=en
- https://www.ibm.com/developerworks/community/blogs/jfp/entry/Machine_Learning_As_Prescriptive_Analytics?lang=en
- https://arxiv.org/pdf/1907.06094.pdf
- https://www.nasdaq.com/articles/5-ways-companies-are-transforming-their-businesses-machine-learning-2019-03-13
- https://www.codeproject.com/Articles/4414/A-Proposed-Model-for-Simulating-Human-Artificial-I
- https://sports.stackexchange.com/questions/749/do-basketball-players-tend-to-improve-at-shooting-free-throws-over-the-course-of
- https://sports.stackexchange.com/questions/12074/how-to-collect-stats-in-nba
- https://sports.stackexchange.com/questions/20922/is-there-a-standard-advanced-stat-in-basketball-for-measuring-player-consist
- https://prosportsanalytics.com/2017/05/25/predicting-nba-salaries-part-1/
- http://horror.dreamdawn.com/?p=49113
- https://www.gamasutra.com/blogs/PaulTozour/20141216/232023/The_Game_Outcomes_Project_Part_1_The_Best_and_the_Rest.php
- https://ai.stackexchange.com/questions/5577/is-it-possible-to-write-an-adaptive-parser
- https://conf.slac.stanford.edu/xldb2019/sites/xldb2019.conf.slac.stanford.edu/files/Wed_10.55_Seyed_Umar_BigQueryML-XLDB2019.pdf
- https://towardsdatascience.com/deep-neural-network-implemented-in-pure-sql-over-bigquery-f3ed245814d3
SQL Algorithms
- https://thenewstack.io/sql-fans-can-now-develop-ml-applications/
- https://www.snowflake.com/blog/synthetic-data-generation-at-scale-part-2/
- https://towardsdatascience.com/machine-learning-with-sql-ae46b1fe78a9?gi=bd255df749bd
- https://towardsdatascience.com/learning-sql-201-optimizing-queries-regardless-of-platform-918a3af9c8b1
Data Quality
- http://www.oecd.org/statistics/statisticalresources.htm
- https://apiumhub.com/tech-blog-barcelona/introduction-perceptual-hashes-measuring-similarity/
- https://www.openprisetech.com/blog/i-predict-your-predictive-scoring-project-will-fail-heres-why-and-how-to-do-it-right/
- http://trap.ncirl.ie/3437/1/marinalambert.pdf
- https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/data-mining-using-machine-learning-to-rediscover-customers-paper.pdf
- https://docs.oracle.com/cd/E14004_01/books/PDF/MKTG_User.pdf
- https://steelkiwi.com/blog/what-is-machine-learning/
- https://www.altexsoft.com/blog/datascience/preparing-your-dat
- https://sports.stackexchange.com/questions/20922/is-there-a-standard-advanced-stat-in-basketball-for-measuring-player-consist
- https://sports.stackexchange.com/questions/12074/how-to-collect-stats-in-nba
Techniques
- https://rosettacode.org/wiki/Rock-paper-scissors
- https://stripe.com/blog/fraud-reporting
- https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/
- https://www.linkedin.com/pulse/predicting-buying-behavior-through-machine-learning-case-mitra/
- https://www.altexsoft.com/blog/datascience/preparing-your-dataset-for-machine-learning-8-basic-techniques-that-make-your-data-better/
- https://www.linkedin.com/pulse/naive-bayes-classifier-foundation-machine-learning-chase-perkins
- https://www.reddit.com/r/askscience/comments/3zghfk/mathematics_probability_question_do_we_treat_coin/
- https://statweb.stanford.edu/~susan/papers/headswithJ.pdf
- https://www.kaggle.com/yusukesaito0141/bigqueryml-is-all-you-need
- https://codegolf.stackexchange.com/questions/11880/build-a-working-game-of-tetris-in-conways-game-of-life
- https://gamedev.stackexchange.com/questions/55151/rpg-logarithmic-leveling-formula
- https://www.forbes.com/sites/audreymurrell/2019/05/30/big-data-and-the-problem-of-bias-in-higher-education/
- https://2021.ai/fairness-in-machine-learning/
- https://newsroom.haas.berkeley.edu/minority-homebuyers-face-widespread-statistical-lending-discrimination-study-finds/
- http://machineintelligenceafrica.org/about/
- https://www.technologyreview.com/s/613848/ai-africa-machine-learning-ibm-google/
- https://medium.com/pandorabots-blog/aiml-tutorial-the-srai-tag-5bb1f9d08169
OCR
Decision Trees
- https://www.w3schools.com/python/python_ml_decision_tree.asp
- https://www.slideshare.net/ShivangiGupta54/tree-pruning-56173803
- https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4980076/
- https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb
- https://www.nltk.org/book/ch06.html
- https://sefiks.com/2018/10/27/how-pruning-works-in-decision-trees/
Relational Data Interoperability
- http://www.kde.cs.tsukuba.ac.jp/~masa/papers/thesis.pdf
- http://cmj4.web.rice.edu/mat_vec.pdf
- https://core.ac.uk/download/pdf/34329967.pdf
- https://towardsdatascience.com/set-theory-basic-notation-da93c3d48090
Full Text Faceted Search Engine Marketing
- https://lucene.apache.org/solr/guide/7_5/machine-learning.html
- https://findwise.com/blog/improve-search-relevance-using-machine-learning-statistics-apache-solr-learning-rank/
- https://www.hillstonenet.com/blog/a-hybrid-approach-to-detect-malicious-web-crawlers/
- http://www.cs.stir.ac.uk/~kms/schools/rps/index.php
- https://www.aclweb.org/anthology/D07-1086
- https://searchengineland.com/google-extends-same-meaning-close-variants-to-phrase-match-broad-match-modifiers-320138
- https://www.datacamp.com/community/tutorials/sem-data-science
- https://moz.com/blog/google-vs-bing-correlation-analysis-of-ranking-elements
- https://itnext.io/apache-solr-because-your-database-is-not-a-search-engine-57705352df8a
- https://content.iospress.com/articles/data-science/ds007
- https://spidermonkey.ca/robot.shtml
- https://docs.oracle.com/cd/E05317_01/psft/acrobat/dm0682.pdf
NLP
- https://luckytoilet.wordpress.com/2018/01/01/real-world-applications-of-automaton-theory/
- https://en.wikipedia.org/wiki/Bag-of-words_model
- https://www.npmjs.com/package/search-index
- https://davidwalsh.name/open-search
- http://2ality.com/2013/06/chrome-omnibox-search.html
- https://docs.servicenow.com/bundle/london-platform-administration/page/integrate/inbound-other-web-services/task/t_BuildSearchProviderForInstance.html
- https://towardsdatascience.com/predicting-logics-lyrics-with-machine-learning-9e42aff63730
- https://www.newscientist.com/article/2115684-machine-learning-lets-computer-create-melodies-to-fit-any-lyrics/
- https://medium.com/@yashka.troy/alphabet-array-solving-anagrams-fc5e1ac68431
- https://www.scirp.org/journal/PaperInformation.aspx?PaperID=20943
- https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=3813&context=etd
- http://www.lwebzem.com/cgi-bin/courses/course_view.cgi?c=naive_bayes_classification_course.cgi
Music
NLG
- title:machine title:learning title:lyrics -ext:pdf
- https://github.com/llSourcell/Rap_Lyric_Generator/blob/master/MarkovRap.py
- https://www.cjr.org/tow_center_reports/guide_to_automated_journalism.php/
- https://www.salesforce.com/blog/2016/09/artificial-intelligence-helps-small-businesses.html
- https://pdfs.semanticscholar.org/08ed/cd794d534450f46ba5969f3e4098a0b4c744.pdf
- https://www.import.io/post/neural-nets-how-regular-expressions-brought-about-deep-learning/
- http://www.agence-nationale-recherche.fr/Project-ANR-14-CE24-0033
Gender Prediction
- https://pypi.org/project/gender-guesser/
- https://github.com/appeler/ethnicolr
- https://www.namsor.com/
- https://stephenholiday.com/articles/2011/gender-prediction-with-python/
- https://www.kdnuggets.com/2015/11/machine-learning-predict-gender.html
- https://github.com/alecglassford/compciv-2016/blob/master/projects/gender-detector-data/README.md
Codegen
- https://www.functionize.com/blog/robot-framework-a-closer-look-at-keyword-driven-testing-approach/
- https://www.codeproject.com/Articles/1156694/A-Look-into-the-Future-Source-Code-Generation-by-t
- https://www.red3d.com/cwr/steer/
Datasets
- https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
- https://archive.ics.uci.edu/ml/datasets/letter+recognition
- https://medium.com/@saframpton/our-lossy-alphabet-a59516e8c3fb
- https://github.com/angular/code.angularjs.org/blob/master/1.4.9/docs/js/search-data.json
Packages
- https://www.predictiveanalyticstoday.com/top-artificial-neural-network-software/
- https://www.kaggle.com/vimota/getting-started-with-bigquery-ml-in-r
Classifiers
- https://blog.feedly.com/data-science-behind-recommendations-in-feedly/
- https://pdfs.semanticscholar.org/ccba/c9c9cda72b27bfda0d780be86da744b6ce7c.pdf
- https://www.supermarketguru.com/articles/the-first-of-its-kind-study-shows-ai-tool-can-improve-best-practices-in-managing-nut-allergies/
- https://www.businessinsider.com/how-to-say-hello-around-the-world-2015-8
- https://growth.wingify.com/what-you-need-to-know-before-you-board-the-machine-learning-train-a81c513098fe
- https://blog.fastforwardlabs.com/2017/03/09/fairml-auditing-black-box-predictive-models.html
- http://inverseprobability.com/talks/notes/the-three-ds-of-machine-learning.html
- https://www.scienceabc.com/innovation/how-did-the-nintendo-game-duck-hunt-work.html
- https://venturebeat.com/2019/01/18/machine-learning-is-rescuing-old-game-textures-in-zelda-and-final-fantasy/
- https://www.quora.com/Which-machine-learning-algorithms-do-Super-Smash-Bros-Wii-U-Amiibos-utilize
- https://towardsdatascience.com/how-to-win-over-70-matches-in-rock-paper-scissors-3e17e67e0dab
Deepfake Detection
- https://regtechafrica.com/thaless-friendly-hackers-unit-invents-metamodel-to-detect-ai-generated-deepfake-images/
- https://www.ycombinator.com/companies/reality-defender
Fantasy Basketball
- https://basketball.fantasysports.yahoo.com/nba/167497/
- https://developer.yahoo.com/fantasysports/guide/
Fantasy Football
- https://archive.fantasysports.yahoo.com/nfl/2016/560404/
- https://www.nfl.com/playerhealthandsafety/resources/press-releases/top-finishers-in-nfl-data-challenge-improve-league-s-ability-to-predict-injuries
Dictionaries
Chatbots
- https://chatbotslife.com/text-classification-using-algorithms-e4d50dcba45
- https://www.codeproject.com/Articles/12454/Developing-AI-chatbots
- https://www.tutorialspoint.com/aiml/
- https://blog.publicinput.com/news/blog/introducing-kevin-kamto-and-machine-learning-92c8cd0d67cf
- https://chatbotsmagazine.com/what-is-the-working-of-a-chatbot-e99e6996f51c
- https://www.marutitech.com/why-can-chatbots-replace-mobile-apps-immediately/
- https://www.gamasutra.com/view/feature/132155/beyond_aiml_chatbots_102.php
- https://easydita.com/a-chatbot-maturity-model/
- https://www.sam-solutions.com/blog/java-is-it-the-best-language-for-artificial-intelligence/
- https://chatterbot.readthedocs.io/en/stable/tutorial.html
Slackbots
- https://medium.com/slack-developer-blog/conversing-with-ai-on-slack-5af2561f98a5
- https://medium.com/@SAPCAI/a-natural-language-slackbot-19ca5b0fc64b
- https://uxplanet.org/design-lessons-from-building-an-ai-slackbot-for-uncle-brian-73f5a9b1fe89
- https://dzone.com/articles/build-a-scheduler-slackbot-in-30-minutes
Finite State Machine
- https://borjaballe.github.io/other/phdthesis.pdf
- https://dzone.com/articles/neural-networks-and-automata-theory
- https://www.quora.com/Arent-Neural-Networks-just-State-Machines
- https://www.bennadel.com/blog/2241-parsing-csv-data-with-an-input-stream-and-a-finite-state-machine.htm
- https://papers.nips.cc/paper/757-fools-gold-extracting-finite-state-machines-from-recurrent-network-dynamics.pdf
Relational Data
- https://machinelearningmastery.com/large-data-files-machine-learning/
- https://towardsdatascience.com/why-does-ai-ml-considering-the-examples-of-chatbots-creation-20b1906274f8
Semantic Data
- https://www.kdnuggets.com/2015/05/webdatacommons-data-web-scale-mining.html
- https://lemire.me/blog/2014/12/02/when-bad-ideas-will-not-die-from-classical-ai-to-linked-data/
Recognition
- https://www.asug.com/news/costco-bakes-machine-learning-into-a-tasty-customer-experience
- https://gdpr.report/news/2017/08/23/deep-learning-not-ai-future/
- https://www.codeproject.com/Articles/1273113/Apple-tron-an-AI-for-farmers
- https://rationalwiki.org/wiki/Machine_learning
- https://www.oreilly.com/library/view/natural-language-annotation/9781449332693/ch04.html
- https://www.quora.com/How-do-I-design-a-system-to-query-the-database-based-on-natural-language-input
Video Games
- https://www.gamasutra.com/view/news/269634/7_examples_of_game_AI_that_every_developer_should_study.php
- https://bleacherreport.com/articles/2796233-inside-nba2ks-journey-to-the-top-of-sports-gaming
- https://www.gamasutra.com/view/news/296245/Have_a_peek_inside_the_AI_code_of_Street_Fighter_II.php
- https://www.gamasutra.com/view/news/316060/Blizzard_experiments_with_machine_learning_to_fight_Overwatch_toxicity.php
- https://www.gamasutra.com/view/news/336455/Learn_how_machine_learning_can_help_you_make_better_games_at_GDC_2019.php
- https://www.gamasutra.com/blogs/BenWeber/20190426/340293/PortfolioScale_Machine_Learning_atZynga.php
- https://towardsdatascience.com/an-exploration-of-neural-networks-playing-video-games-3910dcee8e4a
- https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii
- https://www.freecodecamp.org/news/how-to-use-ai-to-play-sonic-the-hedgehog-its-neat-9d862a2aef98/
Prediction
- https://www.hackerfactor.com/GenderGuesser.php#Analyze
- https://dalibornasevic.com/posts/61-intro-to-machine-learning-in-ruby
- http://blogs.perl.org/users/sergey_kolychev/2017/02/machine-learning-in-perl.html
- https://www.andrewthompson.co/2012/12/prototyping-with-googles-prediction-api.html
- https://php-ml.readthedocs.io/en/latest/machine-learning/datasets/csv-dataset/
- https://code.msdn.microsoft.com/windowsdesktop/Getting-Started-with-34722da0#content
- http://pdl.perl.org/?docs=FAQ
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111592/
- https://automatedinsights.com/customer-stories/associated-press/
- https://www.slideshare.net/judederick/whitepaper-1-butterfly-effect-and-big-data
- https://gigazine.net/gsc_news/en/20180420-machine-learning-predict-chaos/
- http://www.stsci.edu/~lbradley/seminar/butterfly.html
- https://www.quantamagazine.org/machine-learnings-amazing-ability-to-predict-chaos-20180418/
- https://en.wikipedia.org/wiki/Trivia
- https://en.wikipedia.org/wiki/Shahnameh
Deepfakes
- https://apps.dtic.mil/sti/citations/trecms/AD1180860
- https://apps.dtic.mil/sti/trecms/pdf/AD1180860.pdf
- https://apps.dtic.mil/sti/trecms/pdf/AD1178469.pdf
- https://www.darpa.mil/news-events/2024-03-14
- https://www.youtube.com/watch?v=10ENWUrzj-o
- https://www.darpa.mil/news-events/2023-06-16
Vector Features in RDBMSs
Database | Feature | Description | URL | Version Introduced |
---|---|---|---|---|
PostgreSQL | pgvector | An open-source extension that adds support for vector operations and similarity searches. | pgvector | 12.4 |
MySQL | MySQL HeatWave | Includes support for vector store and generative AI capabilities, performing similarity searches with LLMs. | HeatWave | 8.0 |
MariaDB | MariaDB Vector | Allows storing and searching vector data using a modified HNSW algorithm for fast similarity searches. | Vector | 11.7 |
Sybase | Sybase Features | Currently does not have built-in vector database features similar to PostgreSQL, MySQL, and MariaDB. | Sybase Features | N/A |
Teradata | Teradata Features | Teradata provides advanced vector capabilities for data analysis and machine learning applications. | Teradata Features | 16.20 |
Bad Fit
What types of data is most poorly labeled among publicly traded companies?
Publicly traded companies often struggle with labeling certain types of data accurately. Some of the most poorly labeled data include:
- Soft Information: This includes intangible assets like the value of research and development, employee training, and morale. These are difficult to quantify and often lead to inconsistencies in reporting.
- Financial Data: Despite efforts to standardize financial reporting with formats like XBRL (Extensible Business Reporting Language), there are still issues with comparability and accuracy.
- Non-Financial Metrics: Data related to environmental, social, and governance (ESG) factors can be inconsistently labeled and reported, leading to difficulties in comparison and analysis.
Would you like to know more about how companies can improve their data labeling practices?
Bad Data
create_llm_with_bad_quality_data:
steps_and_pitfalls:
data_collection:
poor_data_sources: "Using unreliable or unverified sources can result in collecting irrelevant or incorrect information."
lack_of_diversity: "If the data lacks diversity in language, style, and context, the model will struggle to generalize and understand different inputs."
data_preprocessing:
minimal_cleaning: "Not cleaning or preprocessing data properly leads to noisy inputs, including spelling mistakes, grammatical errors, and inconsistent formatting."
biased_data: "Training with biased data can reinforce harmful stereotypes and provide skewed responses."
model_training:
overfitting: "Training with poor quality data can cause overfitting, where the model performs well on the training data but poorly on new, unseen data."
low_accuracy: "The model will have low accuracy and poor generalization capabilities due to the flawed input data."
evaluation_and_tuning:
inaccurate_evaluation: "Evaluating the model with low-quality validation data results in misleading performance metrics."
poor_tuning: "Inadequate hyperparameter tuning can further degrade the model's performance."
consequences:
unreliable_outputs: "The model will generate inaccurate and unreliable responses, undermining its usefulness."
reinforcement_of_biases: "Using biased data can perpetuate and amplify existing biases."
increased_risks: "Deploying such a model can lead to misinformation and ethical concerns."
Purposeful Exacerbation of Logical Fallacies in Algorithms
When algorithms are purposefully designed to exacerbate logical fallacies like the Gambler's Fallacy and the Hot-Hand Fallacy, it is often to exploit certain behaviors or patterns for specific outcomes, typically in trading or investment contexts. Here are some ways it might happen:
Market Manipulation
Algorithms can be designed to create artificial demand or supply in the market by repeatedly trading based on recent performance, making it appear as though certain stocks are trending. This can mislead other traders into believing there is a sustained trend, which isn't actually based on underlying fundamentals.
Reinforcing Biases
By emphasizing recent trends and neglecting the inherent randomness, these algorithms can exploit the Hot-Hand Fallacy. This can drive up the prices of certain assets artificially, creating a bubble that savvy traders might plan to exploit by shorting once the market corrects itself.
Echo Chamber Effect
Algorithms that amplify the Gambler's Fallacy might focus on past losses or downturns, driving prices down further than warranted by fundamentals. This can lead to undervaluation of assets, which can be exploited later when market corrections occur.
Encouraging Risky Behavior
By giving undue weight to recent successes, such algorithms can encourage investors to make increasingly risky bets, believing their 'streak' will continue. This can lead to greater market volatility, which can be advantageous to certain trading strategies.
High-Frequency Trading (HFT)
In HFT, algorithms may exploit micro-trends by making thousands of trades per second, based on minute price movements and past performance trends. This can distort market prices and create opportunities for profit by capitalizing on these artificial fluctuations.
It's worth noting that such practices are often scrutinized and regulated to prevent market abuse and protect investors.
Data Cleaning
https://www.kaggle.com/code/loganlauton/basic-data-clean-helper-nba-players-team-data
Case Studies on Garbage Data
studies:
- title: "Quantifying Outlierness of Funds from their Categories using Supervised Similarity"
description: "This study explores the impact of miscategorization in mutual funds using a machine learning approach. The researchers found a strong relationship between miscategorization and future returns, highlighting the significant implications for allocation decisions and investment fund managers."
url: "https://arxiv.org/abs/2003.02924"
- title: "Bias and Unfairness in Machine Learning Models: A Systematic Review"
description: "This systematic review examines the current knowledge on bias and unfairness in machine learning models. It discusses various datasets, tools, fairness metrics, and methods for identifying and mitigating bias. The review emphasizes the importance of addressing miscategorization to ensure fair and unbiased models."
url: "https://www.mdpi.com/2076-3417/10/18/6462"
- title: "Evolution and Impact of Bias in Human and Machine Learning Algorithm Interaction"
description: "This research investigates the iterative interaction between humans and machine learning algorithms. The study highlights how biased data and miscategorization can lead to algorithmic bias, which can further exacerbate the problem through iterative processes."
url: "https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0226801"