Skip to content

Computer Science Research Papers Websites Like Amazon

LinkedIn dove deep into its user data and produced a list of the top 40 U.S. companies for attracting and keeping talented employees.

Only three companies in the top 20 (Coca Cola, Under Armor, and Black Rock) were not in the tech industry. These days though, plenty of businesses consider themselves a part of that sector. Take the CEO of Goldman Sachs, who refers to the financial firm as a tech company. Furthermore, coding has become the most important job skill across industries.

As the LinkedIn report notes, with every company going through a tech-driven transformation, “being a talent magnet is going to be what separates the winners from the also-rans.”

Google cinched the top spot among the 40 top attractors. In addition to perks like free meals and massages along with a culture that supports diversity and the creation of the “perfect” team, Google has no trouble drawing candidates. One former Google recruiter estimated that he reviewed 3 million resumes in one year.

But what does it really take to get a job at the top organizations?

Tech knowledge, is of course, a requirement. However, Chris Bolte, the cofounder and CEO of Paysa, a big-data platform providing market insight about compensation and retention, says he’s seeing another trend that is clinching jobs for those without a traditional computer science degree.

“The most explosive growth we’re seeing is the emergence of deep learning,” says Bolte. He explains, “It’s a branch of machine learning and artificial intelligence leveraging what are called neural networks.”

At its simplest, the neural network functions like a web of interconnected brain cells inside a computer that can parse signals from images or video, for example. It can learn to recognize patterns and make decisions in a human way.

Bolte explains that deep learning extends a number of layers “deeper” than what was previously computationally feasible. He says given the amount of data Internet giants create, coupled with the advancement of computing, these deep learning methods are able to model signals more completely.

Artificial intelligence and machine learning as broader skills offer opportunity to a variety of tech talent. In a recent interview with Fast Company, senior machine learning recruiter at Microsoft Amanda Papp revealed: “Not everyone has to have that PhD in computer science. We have folks that have more physics backgrounds or biomathematics or permutational biology. [Machine learning] doesn’t necessarily have to have that standard comp-sci path.”

Paysa’s data shows that programming is still in high demand. At Google, for example, nearly half (45%) of its 60,000 employees know Java, and 42% know Python. Only 13% are knowledgeable in Git (open source software development), and 14% have skills in cloud computing. The majority (83%) of employees at Google have a bachelor’s degree, but only 7% come from Stanford. Other schools include Colorado School of Mines, Carnegie Mellon, and University College Dublin.

The No. 2 company on the list, Salesforce, has 20,000 people on staff, but their top skills are somewhat different than those at Google. As many as 46% know cloud computing and 39% are versed in agile methodologies (project management for software development). Eighty percent hold bachelor’s degrees and draw from schools such as University of California Berkeley, Southeast University, Arizona State, and University of Illinois at Urbana.

At Facebook, programming languages are in demand. Paysa data indicates that 46% of Facebook employees know Java and 44% know Python. Other top skills include C++, distributed systems, algorithms, and machine learning. Similar to the other two companies, most employees (84%) have a four-year degree, but 42% of Facebookers hold master’s degrees, too. This lends credence to recent research that more employers are looking to hire candidates with advanced degrees.

However, Apple isn’t pushing for higher education among all of its workforce. Only 71% of its 100,000 employees earned bachelor’s degrees, and 28% report not having a degree at all. This is due, in part, to the fact that not every employee is working on development in its Cupertino headquarters. This can also be seen in the most prevalent skill among Apple’s ranks, software development (28%), with Java coming in second at 27%.

Rounding out the LinkedIn top five is Amazon. The e-commerce juggernaut draws many of its staffers from nearby University of Washington in Seattle, and 83% of its employees hold bachelor’s degrees. More than half (57%) know Java, and 45% are skilled in software development. Surprisingly for the company that powers so many websites with AWS, less than a quarter (21%) are knowledgeable in web services.

Despite the focus on technical prowess, those in charge of hiring at both Facebook and Google are looking for more than just hard skills.

In a previous interview, Laszlo Bock, Google’s head of people operations, told Fast Company that Alphabet is looking for some very specific traits.

Four things: General cognitive ability . . . Not just raw [intelligence] but the ability to absorb information. Emergent leadership: The idea there being that when you see a problem, you step in and try to address it. Then you step out when you’re no longer needed. That willingness to give up power is really important. Cultural fit: We call it Googleyness, but it boils down to intellectual humility. You don’t have to be warm or fuzzy. You just have to be somebody who, when the facts show you’re wrong, can say that. Expertise in the job we’re gonna hire you for.

This is particularly important for candidates new to the job market. Hiring managers are looking for applicants who have developed soft skills such as communication, teamwork, and leadership, according to a survey by PayScale. As may as 60% of employers found critical thinking and problem solving lacking in entry-level job seekers.

Dan Schawbel, research director at Future Workplace, a cosponsor of the PayScale study, told Fast Company in a previous interview, “No working day will be complete without writing an email or tackling a new challenge, so the sooner you develop these skills, the more employable you will become.”

This is the fifth post in a series of posts on how to build a Data Science Portfolio. You can find links to the others in this series at the bottom of the post.

If you've ever worked on a personal data science project, you've probably spent a lot of time browsing the internet looking for interesting data sets to analyze. It can be fun to sift through dozens of data sets to find the perfect one, but it can also be frustrating to download and import several csv files, only to realize that the data isn't that interesting after all. Luckily, there are online repositories that curate data sets and (mostly) remove the uninteresting ones.

In this post, we'll walk through several types of data science projects, including data visualization projects, data cleaning projects, and machine learning projects, and identify good places to find data sets for each. Whether you want to strengthen your data science portfolio by showing that you can visualize data well, or you have a spare few hours and want to practice your machine learning skills, we've got you covered.

Data sets for Data Visualization Projects

A typical data visualization project might be something along the lines of "I want to make an infographic about how income varies across the different states in the US". There are a few considerations to keep in mind when looking for a good data set for a data visualization project:

  • It shouldn't be messy, because you don't want to spend a lot of time cleaning data.
  • It should be nuanced and interesting enough to make charts about.
  • Ideally, each column should be well-explained, so the visualization is accurate.
  • The data set shouldn't have too many rows or columns, so it's easy to work with.

A good place to find good data sets for data visualization projects are news sites that release their data publicly. They typically clean the data for you, and also already have charts they've made that you can replicate or improve.

1. FiveThirtyEight

FiveThirtyEight is an incredibly popular interactive news and sports site started by Nate Silver. They write interesting data-driven articles, like "Don't blame a skills gap for lack of hiring in manufacturing" and "2016 NFL Predictions".

FiveThirtyEight makes the data sets used in its articles available online on Github.

View the FiveThirtyEight Data sets

Here are some examples:

2. BuzzFeed

BuzzFeed started as a purveyor of low-quality articles, but has since evolved and now writes some investigative pieces, like "The court that rules the world" and "The short life of Deonte Hoard".

BuzzFeed makes the data sets used in its articles available on Github.

View the BuzzFeed Data sets

Here are some examples:

3. Socrata OpenData

Socrata OpenData is a portal that contains multiple clean data sets that can be explored in the browser or downloaded to visualize. A significant portion of the data is from US government sources, and many are outdated.

You can explore and download data from OpenData without registering. You can also use visualization and exploration tools to explore the data in the browser.

View Socrata OpenData

Here are some examples:

Data sets for Data Processing Projects

Sometimes you just want to work with a large data set. The end result doesn't matter as much as the process of reading in and analyzing the data. You might use tools like Spark or Hadoop to distribute the processing across multiple nodes. Things to keep in mind when looking for a good data processing data set:

  • The cleaner the data, the better -- cleaning a large data set can be very time consuming.
  • The data set should be interesting.
  • There should be an interesting question that can be answered with the data.

A good place to find large public data sets are cloud hosting providers like Amazon and Google. They have an incentive to host the data sets, because they make you analyze them using their infrastructure (and pay them).

4. AWS Public Data sets

Amazon makes large data sets available on its Amazon Web Services platform. You can download the data and work with it on your own computer, or analyze the data in the cloud using EC2 and Hadoop via EMR. You can read more about how the program works here.

Amazon has a page that lists all of the data sets for you to browse. You'll need an AWS account, although Amazon gives you a free access tier for new accounts that will enable you to explore the data without being charged.

View AWS Public Data sets

Here are some examples:

5. Google Public Data sets

Much like Amazon, Google also has a cloud hosting service, called Google Cloud Platform. With GCP, you can use a tool called BigQuery to explore large data sets.

Google lists all of the data sets on a page. You'll need to sign up for a GCP account, but the first 1TB of queries you make are free.

View Google Public Data sets

Here are some examples:

  • USA Names — contains all Social Security name applications in the US, from 1879 to 2015.
  • Github Activity — contains all public activity on over 2.8 million public Github repositories.
  • Historical Weather — data from 9000 NOAA weather stations from 1929 to 2016.

6. Wikipedia

Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains an astonishing breadth of knowledge, containing pages on everything from the Ottoman-Habsburg Wars to Leonard Nimoy. As part of Wikipedia's commitment to advancing knowledge, they offer all of their content for free, and regularly generate dumps of all the articles on the site. Additionally, Wikipedia offers edit history and activity, so you can track how a page on a topic evolves over time, and who contributes to it.

You can find the various ways to download the data on the Wikipedia site. You'll also find scripts to reformat the data in various ways.

View Wikipedia Data sets

Here are some examples:

Data sets for Machine Learning Projects

When you're working on a machine learning project, you want to be able to predict a column from the other columns in a data set. In order to be able to do this, we need to make sure that:

  • The data set isn't too messy -- if it is, we'll spend all of our time cleaning the data.
  • There's an interesting target column to make predictions for.
  • The other variables have some explanatory power for the target column.

There are a few online repositories of data sets that are specifically for machine learning. These data sets are typically cleaned up beforehand, and allow for testing of algorithms very quickly.

7. Kaggle

Kaggle is a data science community that hosts machine learning competitions. There are a variety of externally-contributed interesting data sets on the site. Kaggle has both live and historical competitions. You can download data for either, but you have to sign up for Kaggle and accept the terms of service for the competition.

You can download data from Kaggle by entering a competition. Each competition has its own associated data set. There are also user-contributed data sets found in the new Kaggle Data sets offering.

View Kaggle Data sets
View Kaggle Competitions

Here are some examples:

  • Satellite Photograph Order — a data set of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.
  • Manufacturing Process Failures — a data set of variables that were measured during the manufacturing process. The goal is to predict faults with the manufacturing.
  • Multiple Choice Questions — a data set of multiple choice questions and the corresponding correct answers. The goal is to predict the answer for any given question.

8. UCI Machine Learning Repository

The UCI Machine Learning Repository is one of the oldest sources of data sets on the web. Although the data sets are user-contributed, and thus have varying levels of documentation and cleanliness, the vast majority are clean and ready for machine learning to be applied. UCI is a great first stop when looking for interesting data sets.

You can download data directly from the UCI Machine Learning repository, without registration. These data sets tend to be fairly small, and don't have a lot of nuance, but are good for machine learning.

View UCI Machine Learning Repository

Here are some examples:

  • Email spam — contains emails, along with a label of whether or not they're spam.
  • Wine classification — contains various attributes of 178 different wines.
  • Solar flares — attributes of solar flares, useful for predicting characteristics of flares.

9. Quandl

Quandl is a repository of economic and financial data. Some of this information is free, but many data sets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Due to the large amount of available data sets, it's possible to build a complex model that uses many data sets to predict values in another.

View Quandl Data sets.

Here are some examples:

Data sets for Data Cleaning Projects

Sometimes, it can be very satisfying to take a data set spread across multiple files, clean them up, condense them into one, and then do some analysis. In data cleaning projects, sometimes it takes hours of research to figure out what each column in the data set means. It may sometimes turn out that the data set you're analyzing isn't really suitable for what you're trying to do, and you'll need to start over.

When looking for a good data set for a data cleaning project, you want it to:

  • Be spread over multiple files.
  • Have a lot of nuance, and many possible angles to take.
  • Require a good amount of research to understand.
  • Be as "real-world" as possible.

These types of data sets are typically found on aggregators of data sets. These aggregators tend to have data sets from multiple sources, without much curation. Too much curation gives us overly neat data sets that are hard to do extensive cleaning on.

10. describes itself at 'the social network for data people', but could be more correctly describe as 'GitHub for data'. It's a place where you can search for, copy, analyze, and download data sets. In addition, you can upload your data to and use it to collaborate with others.

In a relatively short time it has become one of the 'go to' places to acquire data, with lots of user contributed data sets as well as fantastic data sets through's partnerships with various organizations includeing a large amount of data from the US Federal Government.

One key differentiator of is the tools they have built to make working with data easier - you can write SQL queries within their interface to explore data and join multiple data sets. They also have SDK's for R an python to make it easier to acquire and work with data in your tool of choice (You might be interested in reading our tutorial on the Python SDK.)

View Data sets

11. is a relatively new site that's part of a US effort towards open government. makes it possible to download data from multiple US government agencies. Data can range from government budgets to school performance scores. Much of the data requires additional research, and it can sometimes be hard to figure out which data set is the "correct" version. Anyone can download the data, although some data sets require additional hoops to be jumped through, like agreeing to licensing agreements.

You can browse the data sets on directly, without registering. You can browse by topic area, or search for a specific data set.

View Data sets

Here are some examples:

12. The World Bank

The World Bank is a global development organization that offers loans and advice to developing countries. The World Bank regularly funds programs in developing countries, then gathers data to monitor the success of these programs.

You can browse World Bank data sets directly, without registering. The data sets have many missing values, and sometimes take several clicks to actually get to data.

View World Bank Data sets

Here are some examples:

13. /r/datasets

Reddit, a popular community discussion site, has a section devoted to sharing interesting data sets. It's called the datasets subreddit, or /r/datasets. The scope of these data sets varies a lot, since they're all user-submitted, but they tend to be very interesting and nuanced.

You can browse the subreddit here. You can also see the most highly upvoted data sets here.

View Top /r/datasets Posts

Here are some examples:

14. Academic Torrents

Academic Torrents is a new site that is geared around sharing the data sets from scientific papers. It's a newer site, so it's hard to tell what the most common types of data sets will look like. For now, it has tons of interesting data sets that lack context.

You can browse the data sets directly on the site. Since it's a torrent site, all of the data sets can be immediately downloaded, but you'll need a Bittorrent client. Deluge is a good free option.

View Academic Torrents Data sets

Here are some examples:

  • Enron emails — a set of many emails from executives at Enron, a company that famously went bankrupt.
  • Student learning factors — a set of factors that measure and influence student learning.
  • News articles — contains news article attributes and a target variable.

Bonus: Streaming data

It's very common when you're building a data science project to download a data set and then process it. However, as online services generate more and more data, an increasing amount is generated in real-time, and not available in data set form. Some examples of this include data on tweets from Twitter, and stock price data. There aren't many good sources to acquire this kind of data, but we'll list a few in case you want to try your hand at a streaming data project.

Twitter has a good streaming API, and makes it relatively straightforward to filter and stream tweets. You can get started here. There are tons of options here -- you could figure out what states are the happiest, or which countries use the most complex language. We also recently wrote an article to get you started with the Twitter API here.

Get started with the Twitter API

16. Github

Github has an API that allows you to access repository activity and code. You can get started with the API here. The options are endless -- you could build a system to automatically score code quality, or figure out how code evolves over time in large projects.

Get started with the Github API

17. Quantopian

Quantopian is a site where you can develop, test, and operationalize stock trading algorithms. In order to help you do that, they give you access to free minute by minute stock price data. You could build a stock price prediction algorithm.

Get started with Quantopian

18. Wunderground

Wunderground has an API for weather forecasts that free up to 500 API calls per day. You could use these calls to build up a set of historical weather data, and make predictions about the weather tomorrow.

Get started with the Wunderground API

Next steps

In this post, we covered good places to find data sets for any type of data science project. We hope that you find something interesting that you want to sink your teeth into!

If you do end up building a project, we'd love to hear about it. Please let us know!

At Dataquest, our interactive guided projects are designed to help you start building a data science portfolio to demonstrate your skills to employers and get a job in data. If you're interested, you can signup and do our first module for free.

At Dataquest, our interactive guided projects are designed to help you start building a data science portfolio to demonstrate your skills to employers and get a job in data. If you're interested, you can signup and do our first module for free.

If you liked this, you might like to read the other posts in our 'Build a Data Science Portfolio' series: