DR. JEFF DANIELS
  • Home
  • About
  • Publications and Speaking
  • Contact
Digital Transformation | Leader | Professor

Data Wrangling: 101 Sources for Artificial Intelligence and Machine Learning

8/9/2020

 
  1. The 50 Best Free Datasets for Machine Learning, Lionbridge AI
  2. Google Cloud Public Datasets
  3. Machine Learning and AI Datasets, Carnegie Mellon University
  4. Big Data and AI: 30 Amazing and Free Public Data Sources
  5. Awesome Autonomous Vehicles Datasets, Github
  6. Fueling the Gold Rush, The Greatest Public Datasets for AI, StartupGrind
  7. Places to Find Free Datasets for Data Science Projects, Dataquest
  8. The Best Datasets for Natural Language Processing, Gengo AI
  9. Awesome Public Datasets, Github
  10. StatLib Datasets Archive, Carnegie Mellon
  11. Institutional Research and Analysis | Common Datasets
  12. Datasets and Project Suggestions | Andrew W. Moore
  13. Datasets | Machine Learning Repository | MIT
  14. Datasets | MIT Lincoln Laboratory  
  15. Stanford Large Network Dataset Collection | Stanford University
  16. Stanford Common Dataset | Stanford University
  17. Datalab | UC Berkeley  
  18. Exploring Datasets | Data Science at Berkeley
  19. DeepDrive | UC Berkeley
  20. Machine Learning Datasets and Project Ideas — Work on real-time Data Science Projects | Data Flair
  21. Government, State, City, Local, public data sites and portals
  22. Data APIs, Hubs, Marketplaces, Platforms, and Search Engines.
  23.  Google Dataset Search
  24. Appen Open Source Datasets.
  25. AssetMacro, historical data of Macroeconomic Indicators and Market Data.
  26.  Awesome Public Datasets on github, curated by caesar0301.
  27. AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
  28. BigML big list of public data sources.
  29. Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
  30. Bitly 1.usa.gov data, anonymized clicks on gov links.
  31. Canada Open Data, pilot project with many government and geospatial datasets.
  32. Causality Workbench data repository.
  33. Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science.
  34. Credit Risk Analytics Data: a home equity loans credit data set, mortgage loan level data set, Loss Given Default (LGD) data set and corporate ratings data set.
  35. Data Source Handbook, A Guide to Public Data, by Pete Warden, O'Reilly (Jan 2011).
  36. Datacatalogs.org, open government data from US, EU, Canada, CKAN, and more.
  37. Data.gov.uk, publicly available data from UK (also London datastore.)
  38. Data.gov/Education, central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more.
  39. DataMarket, visualize the world's economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers.
  40. Datamob, public data put to good use.
  41. Data Planet, The largest repository of standardized and structured statistical data, with over 25 billion data points, 4.3 billion datasets, 400+ source databases.
  42. Datasets.co, datasets for data geeks, find and share Machine Learning datasets.
  43. DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA.
  44. DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Government datasets.
  45.  Delve, Data for Evaluating Learning in Valid Experiments
  46. EconData, thousands of economic time series, produced by a number of US Government agencies.
  47. data.world, discover and share cool data, connect with interesting people, and work together to solve problems faster.
  48. Enron Email Dataset, data from about 150 users, mostly senior management of Enron.
  49. Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana - the trusted and comprehensive resource for European cultural heritage content.
  50. FEDSTATS, a comprehensive source of US statistics and more
  51. FIMI repository for frequent itemset mining, implementations and datasets.
  52. Financial Data Finder at OSU, a large catalog of financial data sets.
  53. GDELT: The Global Data on Events, Location and Tone, described by Guardian as "a big data history of life, the universe and everything."
  54. GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
  55. GeoDa Center, geographical and spatial data.
  56. Google ngrams datasets, text from millions of books scanned by Google.
  57. Grain Market Research, financial data including stocks, futures, etc.
  58. HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning.
  59. ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
  60. Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
  61. Investor Links, includes financial data
  62. JMP Public featured datasets
  63. Kaggle Datasets.
  64. KDD Cup center, with all data, tasks, and results.
  65. KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining.
  66. Linking Open Data project, at making data freely available to everyone.
  67. LoveTheSales data request page, free access to data for editors and academics to mine stats on the retail industry.
  68.  Lyst Fashion Data Trends, tracking 10 million global fashon searches a month, easily and freely accessible to academics as a valuable resource.
  69. Million Song Dataset
  70. MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
  71. ML Data, the data repository of the EU Pascal2 networks.
  72. NASDAQ Data Store, provides access to market data.
  73. National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
  74. National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
  75.  NetworkRepository: Interactive Data Repository, has many collections of graph and networks from social science, machine learning, scientific computing, and other areas.
  76. Open Data Census, assesses the state of open data around the world.
  77. OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun.
  78. Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey.
  79. Peter Skomoroch dataset Bookmarks
  80. PubGene(TM) Gene Database and Tools, genomic-related publications database
  81. Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets.
  82. Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance.
  83. SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
  84. Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data.
  85. SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.
  86. Sports Statistics, with data for Soccer, NBA, NFL, NHL, and more.
  87. StatLib, CMU Datasets Archive.
  88. Time Series Data Library
  89. Vhinny, provides fundamental financial information on the website and in .csv datasets for download.
  90. Visual Analytics Benchmark Repository.
  91. UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
  92. UCI Machine Learning Repository.
  93. UCR Time Series Data Archive, offering datasets, papers, links, and code.
  94. UK Open Postcode Geo, UK/British postcodes with easting, northing, latitude, and longitude.
  95. United States Census Bureau.
  96.  Web Data Commons, structured data from the Common Crawl, the largest public web corpus.
  97. Webhose free datasets
  98. Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources
  99. Wolfram Alpha disease and patient level data.
  100. Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition
  101. Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research
Sources: 
  • LionsBridge AI
  • ​Medium, Tomorrow AI
  • KDNuggets Data Set

Comments are closed.
    Picture

    Author

    Director
    @lockheedmartin
    | Professor
    @UMDGlobalCampus
    | 1st Cloud Dissertation | Top 5 #Thinkers360 #blockchain #cloud #iot #AI #AIEthics #digital #cyber #5g

    View my profile on LinkedIn
    Follow @jeffdaniels
    Tweets by jeffdaniels

    RSS Feed

    Archives

    January 2023
    December 2022
    August 2022
    March 2021
    February 2021
    January 2021
    December 2020
    September 2020
    August 2020
    February 2020
    January 2019
    October 2015
    April 2015
    January 2015
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    March 2014
    February 2014
    January 2014
    December 2013
    November 2013
    August 2013
    July 2013
    June 2013
    February 2013
    December 2012
    October 2012
    September 2012
    August 2012
    April 2012
    March 2012
    February 2012
    January 2012
    December 2011
    November 2011
    October 2011
    September 2011

    Categories

    All
    4h
    Acoustic
    Adele
    Adoption
    Aero
    Aerospace
    Airshow
    Alliance
    Architect
    Architecture
    Astronaut
    Augustine
    Bahill
    Book
    Books
    Boxing
    Budget
    Business
    Business Card
    Candidate
    Card
    Career
    Careerdevelopment
    Chan
    Chowder
    Cio
    Cities Names
    Clam
    Cloud
    Cloudcomputing
    Cnci
    College
    Computing
    Conference
    Connectivity
    Crowe
    Csedweek
    Cto
    Cyber
    Cybersecurity
    Deep Dive
    Defense
    Denise
    Dfw
    Digital
    Ebook
    Education
    Email
    Engineering
    Exploration
    Extreme
    F35
    Fall
    Fb
    Fedgov
    Fighter
    Flight
    Flighttest
    Florida
    Food
    Framework
    Frazier
    Get
    Gissing
    Glennis
    Google
    Haunted
    Hbr
    Heterogeneous
    History
    Homogeneous
    Horwath
    House
    Ideacast
    Identity
    Insiderhighered
    Internet
    Interview
    Joe
    Jsf
    Kindle
    Kindlefire
    Klout
    Kolditz
    Leadership
    Learning
    Linkedin
    Lm
    Martin
    Meeting
    Mentor
    Miracles
    Mit
    Mobile
    Monkey
    Mst3k
    Music
    Nasa
    Nascar
    Nelson
    Netflix
    Networking
    Nist
    Norm
    Orlando
    Phd
    Pictures
    Post
    Practice
    Process
    Pumpkin
    Put
    Quote
    Races
    Ragan
    Recipe
    Results
    Robots
    Role
    Rollinginthedeep
    Scary
    Search
    Security
    Servo
    Silence
    Simian
    Smokin
    Smoothie
    Snarky
    Socialnetwork
    Sound Barrier
    Space
    Speakup
    Spending
    Star
    Stem
    Sterman
    Strategy
    Success
    Systems
    Systemsengineering
    Teaching
    Teamtexas
    Techmgmt
    #techmgmt
    Techmgmt#
    #techmgt
    Technology
    Texas
    Tms
    Togaf
    Townhall
    Treat
    Trend
    Trust
    Tx
    Web
    Web2.0
    X1
    Yeager

    RSS Feed

Powered by Create your own unique website with customizable templates.
Photos from europeanspaceagency, ▓▒░ TORLEY ░▒▓, Lori_NY, Dean_Groom, dalecruse, Fin Cosplay & Amigurumi, Iain Farrell, erin_everlasting, palindrome6996, Easa Shamih (eEko) | P.h.o.t.o.g.r.a.p.h.y, markhillary, Matt McGee, Marc_Smith, woodleywonderworks, agustilopez, rachel_titiriga, SeaDave, cheri lucas., Caio H. Nunes, grabbingsand, Armchair Aviator, quinn.anya, Jennifer Kumar, billaday, edtechworkshop, chucknado, purpleslog, yugenro, christianeager, dground, GlasgowAmateur, expertinfantry, shixart1985 (CC BY 2.0), OiMax, Wilfried Martens Centre for European Studies, PEO, Assembled Chemical Weapons Alternatives, IBM Research, shixart1985, markus119, shixart1985, shixart1985, Wilfried Martens Centre for European Studies
  • Home
  • About
  • Publications and Speaking
  • Contact