Data providers
As data scientists, we need data to do “science” on. If we want to analyze a country’s housing market, we need to know each region’s current population, population growth, available housing, and likely many more statistics. If we want to study short-term trends in stock price data, we want a data source that provides us with prices at the smallest time-interval we can get. If we want to create a machine-learning model to provide us with automated marketing insights, we need data to train it on and then data to analyze.
I was looking for data sources to test database capabilities and data analysis methods. I ended up with some general groups of sites where we can find data: governments and intergovernmental organizations, universities and research institutes, data vendors, and others.
Governments and intergovernmental organizations
Many governments and intergovernmental organizations collect and process statistics they freely share. The availability and amount of data is highly country dependent but usually of good quality. America is likely the country with the most data.
Some examples:
- https://fred.stlouisfed.org/ shares up-to-date economic data on America, such as, unemployment rates, inflation rates, and the crowd favorite, consumer price index.
- https://data.europa.eu/en has a huge amount of freely available data sets, nearing 2 million at the time of writing. Currently, they’re promoting marine data, and there is also a data set concerning individuals and groups the European union sanctioned. These datasets look more like one off analysis and not like they’re kept up to date.
- https://www.cbs.nl/en-gb/ has many data sets on the Netherlands concerning topics such leisure and culture, enterprises, and basic economics data.
Universities and research institutes
Universities and research institutes share data for transparency and reproducibility purposes. For data science it’s useful to check and practice methods on the data sets provided with the papers that introduce the method. I’m not sure if I would suggest searching through these open data bases as they are immense.
Two examples:
- https://zenodo.org/ has 4.6 million data sets. I find it quite hard to navigate and prefer to just use it when I paper I’m reading references it.
- https://dataverse.harvard.edu/ has 190,405 data sets at the time of writing. Way less than Zenodo but better organized.
Data vendors
There is a huge business in collecting, consolidating and selling data. The businesses that partake in this endeavor are are data vendors. It is an interesting question what the value of data is, I guess some combination of how useful it is together with how hard it is to acquire.
Two examples:
- https://tradingeconomics.com/ provides economic indicators for a reasonable price. I used it once to get price data to support a research paper. Interesting if you want to play around with fitting pricing models.
- https://openweathermap.org/ provides weather data. They have a free and paid offering. There are a bunch of other providers and it’s unclear to me how they compare.
Others
Some of other places to retrieve data:
- https://www.alphavantage.co/ provides historical real-time stock market data. I’m planning to use this to practice building an updating pipeline.
- https://www.mediawiki.org/wiki/API:Main_page shows the Wikipedia endpoints. They give access to Wikipedia’s pages which gives a wealth of information to do text analysis on.
- https://wiki.openstreetmap.org/wiki/API gives access to open street map. A map database for all you’re favourite route planning and traveling salesman exercises.
- https://www.bfro.net/gdb/ last but not least Bigfoot’s (or Bigfoots’?) sightings. Can you determine where he will appear next based on the data, so he can be captured on camera?
- Of course, there are many more publishers of datasets. If you’re looking for something in particular have a look at https://datasetsearch.research.google.com/.