Data Mining Project Ideas -
1. DBLP Research publication database
This data set consists of computer science bibliography data. An interesting browser for viewing this dataset is available too.
Project ideas:
Project C1: clustering evolution: a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. An interesting study will be to model the changes of the clusters over time.
Project C2: distance measure study: a good distance function is crucial for the success of any learning algorithm. It is especially true for heterogeneous dataset like this, where naive distance function such as Euclidian distance is undefined.
2. H1B Visa Workers
The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, an US employer must offer a job and petition for H-1B visa with the US immigration department. This is the most common visa status applied for and held by international students once they complete college/ higher education (Masters, PhD) and work in a full-time position. The Office of Foreign Labor Certification (OFLC) generates program data that is useful information about the immigration programs including the H1-B visa. The disclosure data updated annually is available at this link.
However, the raw data available is messy and might not be suitable for rapid analysis. A set of data transformations were performed making the data more accessible for quick exploration.
Inspiration: Is the number of petitions with Data Engineer job title increasing over time? Which part of the US has the most Hardware Engineer jobs? Which industry has the most number of Data Scientist positions? Which employers file the most petitions each year?
3. HappyDB
HappyDB is a corpus of more than 100,000 happy moments crowd-sourced via Amazon's Mechanical Turk.
Each worker is given the following task: What made you happy today? Reflect on the past 24 hours, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence. (Write three such moments.) The goal of the corpus is to advance the understanding of the causes of happiness through text-based reflection. More information is available on the HappyDB website.
Inspiration: To provide some inspiration, here are a few sample interesting exploration questions.
- What are the popular sports/movies/books/purchased products/tourist destinations/... that make people happy?
- Can we predict gender/marriage status/parenthood/age groups based on happy moment texts?
- How many indoor and outdoor activities are in the corpus respectively?
- Can we find interesting ways of clustering happy moments?
3. Recruit Restaurant Visitor Forecasting (Open Competition Project)
Predict how many future visitors a restaurant will receive.
Running a thriving local restaurant isn't always as charming as first impressions appear. There are often all sorts of unexpected troubles popping up that could hurt business.
One common predicament is that restaurants need to know how many customers to expect each day to effectively purchase ingredients and schedule staff members. This forecast isn't easy to make because many unpredictable factors affect restaurant attendance, like weather and local competition. It's even harder for newer restaurants with little historical data.
Recruit Holdings has unique access to key datasets that could make automated future customer prediction possible. Specifically, Recruit Holdings owns Hot Pepper Gourmet (a restaurant review service), AirREGI (a restaurant point of sales service), and Restaurant Board (reservation log management software).
In this competition, you're challenged to use reservation and visitation data to predict the total number of visitors to a restaurant for future dates. This information will help restaurants be much more efficient and allow them to focus on creating an enjoyable dining experience for their customers.
4. Toxic Comment Classification (Open Competition Project)
Identify and classify toxic online comments
Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.
The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they've built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don't allow users to select which types of toxicity they're interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).
In this competition, you're challenged to build a multi-headed model that's capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective's current models. You'll be using a dataset of comments from Wikipedia's talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.
Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.
5. How ISIS Uses Twitter
Analyze how ISIS fanboys have been using Twitter since 2015 Paris Attack
We scraped over 17,000 tweets from 100+ pro-ISIS fanboys from all over the world since the November 2015 Paris Attacks. We are working with content producers and influencers to develop effective counter-messaging measures against violent extremists at home and abroad. In order to maximize our impact, we need assistance in quickly analyzing message frames.
The dataset includes the following:
1. Name
2. Username
3. Description
4. Location
5. Number of followers at the time the tweet was downloaded
6. Number of statuses by the user when the tweet was downloaded
7. Date and timestamp of the tweet
8. The tweet itself
Based on this data, here are some useful ways of deriving insights and analysis:
- Social Network Cluster Analysis: Who are the major players in the pro-ISIS twitter network? Ideally, we would like this visualized via a cluster network with the biggest influencers scaled larger than smaller influencers.
- Keyword Analysis: Which keywords derived from the name, username, description, location, and tweets were the most commonly used by ISIS fanboys? Examples include: "baqiyah", "dabiq", "wilayat", "amaq"
- Data Categorization of Links: Which websites are pro-ISIS fanboys linking to? Categories include: Mainstream Media, Altermedia, Jihadist Websites, Image Upload, Video Upload,
- Sentiment Analysis: Which clergy do pro-ISIS fanboys quote the most and which ones do they hate the most? Search the tweets for names of prominent clergy and classify the tweet as positive, negative, or neutral and if negative, include the reasons why. Examples of clergy they like the most: "Anwar Awlaki", "Ahmad Jibril", "Ibn Taymiyyah", "Abdul Wahhab". Examples of clergy that they hate the most: "Hamza Yusuf", "Suhaib Webb", "Yaser Qadhi", "Nouman Ali Khan", "Yaqoubi".
- Timeline View: Visualize all the tweets over a timeline and identify peak moments Further Reading: "ISIS Has a Twitter Strategy and It is Terrifying [Infographic]"
About Fifth Tribe
Fifth Tribe is a digital agency based out of DC that serves businesses, non-profits, and government agencies. We provide our clients with product development, branding, web/mobile development, and digital marketing services. Our client list includes Oxfam, Ernst and Young, Kaiser Permanente, Aetna Innovation Health, the U.S. Air Force, and the U.S. Peace Corps. Along with Goldman Sachs International and IBM, we serve on the Private Sector Committee of the Board of the Global Community Engagement and Resilience Fund (GCERF), the first global effort to support local, community-level initiatives aimed at strengthening resilience against violent extremism. In December 2014, we won the anti-ISIS "Hedaya Hack" organized by Affinis Labs and hosted at the "Global Countering Violent Extremism (CVE) Expo " in Abu Dhabi. Since then, we've been actively involved in working with the open-source community and community content producers in developing counter-messaging campaigns and tools.
6. Exercise Pattern Prediction:
What does your exercise pattern fall into?
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. Our goal here will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Inspiration - What better ways of cleaning up the data? Which model will fit it the best?
7. Data Science Bowl
Can you improve lung cancer detection?
In the United States, lung cancer strikes 225,000 people every year, and accounts for $12 billion in health care costs. Early detection is critical to give patients the best chance at recovery and survival.
One year ago, the office of the U.S. Vice President spearheaded a bold new initiative, the Cancer Moonshot, to make a decade's worth of progress in cancer prevention, diagnosis, and treatment in just 5 years.
In 2017, the Data Science Bowl will be a critical milestone in support of the Cancer Moonshot by convening the data science and medical communities to develop lung cancer detection algorithms.
Using a data set of thousands of high-resolution lung scans provided by the National Cancer Institute, participants will develop algorithms that accurately determine when lesions in the lungs are cancerous. This will dramatically reduce the false positive rate that plagues the current detection technology, get patients earlier access to life-saving interventions, and give radiologists more time to spend with their patients.
This year, the Data Science Bowl will award $1 million in prizes to those who observe the right patterns, ask the right questions, and in turn, create unprecedented impact around cancer screening care and prevention. The funds for the prize purse will be provided by the Laura and John Arnold Foundation.
- Sign up to receive news about the competition
- Learn about the history of the Data Science Bowl and past competitions
- Read our latest insights on emerging analytics techniques
8. NBA statistics data
This dataset contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season.
Project idea:
Project H1: outlier detection on the players; find out who are the outstanding players.
Project H2: predict the game outcome.
Attachment:- Assignment File.rar