latest research papers of data mining

Recent advances in domain-driven data mining

Published: 27 December 2022
Volume 15 , pages 1–7, ( 2023 )

Cite this article

Chuanren Liu 1 ,
Ehsan Fakharizadi 2 ,
Tong Xu 3 &
Philip S. Yu 4

3588 Accesses

3 Citations

1 Altmetric

Explore all metrics

Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related workshop to continue the previous efforts on promoting advances in domain-driven data mining. This editorial report will first summarize the selected papers in the special issue, then discuss various industrial trends in the context of the selected papers, and finally document the keynote talks presented by the workshop. Although many scholars have made prominent contributions with the theme of domain-driven data mining, there are still various new research problems and challenges calling for more research investigations in the future. We hope this special issue is helpful for scholars working along this critically important line of research.

Explore related subjects

Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Summary of research contributions

Data mining has been a trending research area with contributions from diverse communities including computer scientists, statisticians, mathematicians, as well as other researchers and engineers working on data-intensive problems. While many researchers focus on general data mining methodologies for standardized problem settings, such as unsupervised learning and supervised learning, applying general solutions to specific problems may still be a nontrivial challenge. This is mainly due to the need to incorporate domain knowledge in implementing data mining solutions for novel real-world applications. Oftentimes standardized solutions must be significantly revised to accommodate unique characteristics of input data and deliver actionable results in novel application domains. Essentially, data mining research is highly applied. Many classic research problems are motivated by real-world applications and results of data mining research are expected to provide practical implications to business managers, government agencies, and all members of our society.

1.1 Overview of domain-driven data mining

Domain-driven data mining aims to bridge the gaps between theoretical research and practical applications in data mining and transform data intelligence to business value and impact [ 11 , 12 ]. Domain-driven data mining has been proposed as a research framework for discovering actionable knowledge and intelligence in a complex environment to directly transform data to decisions or enable decision-making actions [ 3 , 16 ].

Domain-driven data mining handles ubiquitous X-complexities and X-intelligences surrounding domain-driven actionable intelligence discovery. Examples of X-complexities and X-intelligences are related to domain complexity and intelligence, data complexity and intelligence, behavior complexity and intelligence, network complexity and intelligence, social complexity and intelligence, organizational complexity and intelligence, human complexity and intelligence, and their integration and meta-synthesis [ 8 , 16 ]. Analyzing and learning X-complexities and X-intelligences result in X-analytics [ 8 ] in various domains and on specific purposes. Examples are business analytics, behavior analytics, social analytics, operational analytics, risk analytics, customer analytics, insurance analytics, learning analytics, cybersecurity analytics, and financial analytics [ 15 , 21 , 24 , 26 , 28 , 29 , 31 , 38 , 40 , 41 , 42 , 43 , 51 ]. One prominent example of learning data complexities for in-depth data intelligence is the research on non-IID learning, which learns interactions and couplings (including correlation and dependency) involved in heterogeneous data, behaviors, and systems. Non-IID learning is applicable to many real-world applications such as non-IID outlier detection, non-IID recommendation, non-IID multimedia and multimodal analytics, and non-IID federated learning [ 5 , 6 , 17 ].

Domain-driven data mining also handles typical research issues and gaps in existing body of knowledge for domain-driven and actionable intelligence delivery. The research on domain-driven actionable intelligence discovery includes but is not limited to: quantifying knowledge actionability (rather than just interestingness) of data mining results [ 14 ], domain knowledge representation and domain generalization [ 30 ], domain-driven actionable knowledge discovery process [ 3 , 16 ], context-aware analytics and learning [ 46 ], discovering actionable patterns by combined mining [ 4 , 54 ] and high-utility mining [ 27 ], pattern relation analysis [ 4 ], cross-domain and transfer learning [ 24 , 36 , 45 , 51 ], data-to-decision transformation [ 8 ], personalized learning and recommendation [ 49 ], next-best action learning and recommendation [ 13 , 23 ], reflective learning with explicit and implicit feedback [ 32 , 50 ], explainable and interpretable analytics and learning [ 18 ], unbiased and fair analytics and learning [ 1 , 25 , 32 ], privacy and security-preserved analytics [ 52 ], and ethical analytics [ 34 ].

To better understand the challenges, recent advances, and new opportunities in domain-driven data mining, this special issue, along with other related activities, was proposed to call for the latest theoretical and practical developments, expert opinions on the open challenges, lessons learned, and best practices in domain-driven data mining. The special issue received submissions from researchers with different backgrounds, but all focusing on data-intensive research topics with novel applications. The papers accepted in this special issue explored novel factors and challenges such as socioeconomic, organizational, human-centered, and cultural aspects in different data mining tasks. In the following, we first provide a summary of the selected papers in the special issue.

1.2 Applied and flexible deep learning

Deep representation learning has attracted much attention in recent years. For chronic disease diagnosis, Zhang et al. [ 48 ] designed an unsupervised representation learning method to obtain informative correlation-aware signals from multivariate time series data. The key idea was a contrastive learning framework with a graph neural network (GNN) encoder to capture inter- and intra-correlation of multiple longitudinal variables. The work also considered modeling uncertainty quantification with evidential theory to assist the decision-making process in detecting chronic diseases. Also based on deep learning models, Sun et al. [ 37 ] adopted the sequential long short-term memory (LSTM) models in the domain of sports analytics for the baseball industry. With the numbers of home runs as the predictive target, the authors applied their models on the data from Major league Baseball (MLB) to support important decisions in managing players and teams. The results showed that deep learning model could perform better and bring valuable information to meet users’ needs. Focusing on more fundamental deep learning techniques, Zhao et al. [ 53 ] developed a flexible approach to compact architecture search for deep multitask learning (MTL) problems. Though sharing model architectures is a popular method for MTL problems, identifying the appropriate components to be shared by multiple tasks is still a challenge. Based on the expressive reinforcement learning framework, this paper proposed to discover flexible and compact MTL architectures with efficient search space and cost.

1.3 Interpretable and actionable predictions

A critical challenge facing data mining research is to discover actionable knowledge that can directly support decision-making tasks. In the domain of agricultural business and ecosystem management, Basak et al. [ 2 ] applied machine learning methods for a novel problem of soil moisture forecasting. The two modeling challenges were accurate long-term prediction and interpretable hydrological parameters. The proposed domain-driven solution was rooted in deterministic and physically based hydrological redistribution processes of gravity and suction.

As another example of actionable knowledge discovery, Dey et al. [ 19 ] proposed a systematic approach for fire station location planning. As urban fires could adversely affect the socioeconomic growth and ecosystem health of our communities, the authors applied various data mining and machine learning models in working with the Victoria Fire Department to make important decisions for selecting location of a new fire station. The key idea in their approach was to develop effective models for demand prediction and utilize the models to define a generalized index to measure quality of fire service in urban settings. The paper integrated multiple data sources and important domain knowledge/requirements in the modeling process. The final decision task was formulated as an integer programming problem to select the optimal location with maximum service coverage.

For sequential e-commerce product recommendation, Nasir and Ezeife [ 33 ] proposed the Semantic Enabled Markov Model Recommendation system to address long-standing challenges such as model complexity, data sparsity, and ambiguous predictions. Their system was proposed to extract and integrate sequential and semantic knowledge as well as contextual features. The new system showed improved recommendation performance for multiple e-commerce recommendation tasks.

1.4 Unsupervised learning with domain knowledge

Incorporating domain knowledge for unsupervised learning is particularly challenging due to the lack of clearly defined learning target. In the domain of health care, Jasinska-Piadlo et al. [ 22 ] explored the advantages and the challenges of a “domain-led” approach versus a data-driven approach to K -means clustering analysis. The authors compared expert opinions and principal component analysis for selecting the most useful variables to be used for the K -means clustering. The paper discussed comparative advantages of each approach and illustrated that domain knowledge played an important role at the interpretation stage of the clustering results. The authors developed a practical checklist guiding how to enable the integration of domain knowledge into a data mining project.

Similarly, text mining and natural language process are important research tools in many areas. However, many state-of-the-art text and language models are developed for general context, and careful adaption is often needed in applying such techniques on domain-specific data. In this special issue, Villanes and Healey [ 39 ] investigated the use of sentiment dictionaries to estimate sentiment for large document collections. The authors presented a semiautomatic method for extending general sentiment dictionary for a specific target domain. To minimize manual effort, the authors combined statistical term identification and term evaluation using Amazon Mechanical Turk in a study on dengue fever. The same approach could be potentially applied for constructing similar term-based sentiment dictionary in other target domains.

2 New trends from the industry perspective

A continuing trend in the data mining field has been the proliferation of its applications to new domains. This is partly due to the advancements in machine learning technologies evidenced by and promoted through frequent reports of new performance records on benchmark tasks. Another contributor to this proliferation is the increase in the quantity of data collected, stored, and appropriately documented for mining since the benefits of leveraging this data has become more apparent. Some of the works in this special issue demonstrated how data mining techniques can be applied in agriculture [ 2 ], health care and medicine [ 22 , 48 ], and city planning [ 19 ].

One aspect of data quality at the core of this expansion is the growing use of rich data formats. Image, audio, video, and raw text can now be almost directly fed into models that process them to extract meaningful features, patterns, and insights. These formats now often supplement the tabular data structures of the past as shown by Nasir and Ezeife [ 33 ]. To accommodate using these new formats, data mining and machine learning models have adapted to support multi-channel, multimodal, and sequential inputs [ 33 , 37 ].

As more domains employ novel data mining techniques, there have been more opportunities for cross-domain spillovers. We now see more examples of transfer learning, where models trained on one (source) domain are applied in another (target) domain suffering from data scarcity. However, learning generalized models that perform well on multiple tasks could be a challenging process [ 53 ]. These models are often trained with self-supervision on large data and contain millions or billions of learned parameters, such as models for language processing (e.g., BERT, GPT-3, XLNet) and image classification (ResNet, EfficientNet, Inception). A fundamental property of many generalized models is their ability to encode the input data into a vectorized representation, as evidenced by Zhang et al. [ 48 ].

Another recent challenge in data mining, one that is especially amplified in the case of transfer learning involving large models, is the issue of compactness. In many domains, where there is a need for scalable low-latency inferences and when the cost of training new models and deploying them could get high, it becomes necessary to restrict the model size. There are several techniques to accomplish these objectives including pruning, distilling, and training with constraints as Zhao et al. [ 53 ] demonstrated in this special issue.

Along with these trends, there have been several key developments in the structures used for data mining. One that has drastically improved the ability to digest sequential data is the invention of transformer structures. Transformers have effectively revolutionized the deep learning field by enabling models to understand the internal relationship between interdependent data points. These structures are the primary building blocks of some of the large generalized models mentioned above. Another recent progress is the improved ability of the generative models that learn not to score or classify but to create rich outputs such as images, texts, or audio. We also continue seeing more expansion in the field of graph neural network, where models learn and reproduce attributes of a graph data structure [ 48 ].

The sophistication of data mining methods has resulted in improved performance but comes at a cost. Models that use larger and richer input data, capture complex interaction between data points, and map the inputs to abstract representation spaces are very hard if not impossible to interpret. In many domains, it is important for the model outputs to be explainable to decision makers. Explainability matters for three reasons. First, explainable results are more powerful at both convincing decision makers and educating them with insights from the data [ 2 ]. Explainability is also a safeguard against models learning human biases and learning to discriminate. Finally, in some applications, it is necessary to understand not just the predicted value, but also the uncertainty of the predictions. Uncertainty modeling and quantification may be necessary in order to know when to rely on the machine and when to rely on the human. A recently popularized concept in this area is the human-in-the-loop approach, where models continuously receive and learn input from human experts and human decision makers, and meanwhile, experts use model predictions in their decision making on regular basis. Our authors in this special issue have demonstrated great potential of domain-driven data mining in addressing the aforementioned challenges, and more work is needed in the future with the collaboration between academia and industry.

3 Domain-driven data mining workshop

To facilitate the exchange of recent advances in domain-driven data mining, the Domain-Driven Data Mining Workshop was organized as a part of the 2021 SIAM International Conference on Data Mining. The workshop invited three keynote speakers and received paper submissions from multiple institutions. The papers accepted by the workshop were later invited for potential publication in this special issue. In the following, we review the invited keynote talks at the Domain-Driven Data Mining Workshop.

3.1 Actionable intelligence discovery

We first invited Dr. Longbing Cao for his keynote talk, “Domain-Driven and Actionable Intelligence Discovery.” In 2004, Dr. Cao proposed the concept “domain-driven data mining” and has led to implement many large enterprise data science projects for actionable knowledge discovery for governments and businesses, involving over 10 domains including capital markets, banking, insurance, telecommunication, transport, education, smart cities, online business, and public sectors (e.g., financial service, taxation, social welfare, IP, regulation, immigration).

Dr. Cao led a series of activities and proposed “domain-driven data mining” for “actionable knowledge discovery” in complex domains and problems, when discovering “actionable intelligence” was not a trivial task. The significant developments of data science, new-generation AI, and deep neural learning make domain-driven actionable intelligent discovery possible with progress made such as in representing and learning various complexities and intelligences in complex systems, data, and behaviors. In his talk, Dr. Cao first reviewed the aims, progresses, and gaps of conventional data mining/knowledge discovery and machine learning, domain-driven actionable knowledge discovery, and challenges and opportunities in domain-driven actionable intelligence discovery. Then, Dr. Cao discussed related strategic issues in data science thinking [ 8 ], new-generation AI [ 9 ], and actionable deep learning. Dr. Cao shared many thought-provoking illustrations, case studies, and theoretical and practical challenges in industry and government data sciences.

Particularly, Dr. Cao has made broad and in-depth contribution in understanding data complexities and data intelligence. One of his recent foci is learning from non-IID data, forming the research on non-IID learning [ 10 , 17 ]. Non-IID learning goes beyond the classic analytical and learning systems based on the common independent and identically distributed (IID) assumption widely taken in existing science, technology, and engineering. It studies the comprehensive non-IIDnesses [ 5 ], i.e., coupling relationships and interactions (including but beyond correlation and dependency) [ 6 ], and heterogeneities (including but beyond nonidentical distribution) in data, behaviors, and systems. The research on non-IID learning has evolved to almost all areas in data mining, analytics, and learning [ 17 ], such as non-IID data preparation, non-IID feature engineering, non-IID representation learning, non-IID similarity and metric learning, non-IID statistical learning, non-IID learning architecture, non-IID ensemble learning, non-IID federated learning, non-IID transfer learning, non-IID evaluation and validation, and various non-IID learning applications, such as non-IID recommender systems, non-IID outlier detection, non-IID information retrieval, and non-IID image and vision learning [ 5 , 20 , 35 , 47 , 55 ].

For instance, Cao [ 7 ] emphasized the critical issues of the intrinsic assumption that IID users and items in existing recommender systems, leading to false, misleading or incorrect recommendation, and poor performance in cold-start, sparse, and dynamic recommendations. Therefore, a non-IID theoretical framework is needed in order to build a deep and comprehensive understanding of the intrinsic nature of recommendation problems, from the perspective of both couplings and heterogeneities. Such research investigations led by Dr. Cao have triggered the paradigm shift from IID to non-IID recommendation research and can hopefully deliver informed, relevant, personalized, and actionable recommendations. All together, these contributions led to exciting new directions and fundamental solutions to address various challenges including cold-start, sparse data-based, cross-domain, group-based, and shilling attack-related issues in recommender systems.

3.2 A deep learning framework

We invited Dr. Balaji Padmanabhan for his keynote talk titled “Domain-Driven Data Mining: Examples and a Deep Learning Framework.” Dr. Padmanabhan is the Anderson Professor of Global Management and Professor of Information Systems at the University of South Florida’s Muma College of Business, where he is also the director of the Center for Analytics and Creativity. He has worked in data science, AI/machine learning, and business analytics for over two decades in the areas of research, teaching, business management, mentoring graduate students, and designing academic programs. He has also worked with over twenty firms on machine learning and data science initiatives in a variety of sectors. He has published extensively in data science and related areas at premier journals and conferences in the field and has served on the editorial board of leading journals including Management Science, MIS Quarterly, INFORMS Journal on Computing, Information Systems Research, Big Data, ACM Transactions on MIS, and the Journal of Business Analytics.

Dr. Padmanabhan witnessed and led the development of data mining. “I did my PhD at that time when the term of data mining first came up,” he shared with the audience of the workshop audience and reviewed the history of domain-driven data mining research. Then he presented a series of examples over the last two decades of his work. In generalizing from these examples, he emphasized that there are often different extents to which “domain” matters in different data mining endeavors. Dr. Padmanabhan encouraged the workshop audience to “think domain-driven,” which often motivates novel domain-driven methods that can meanwhile be applied more broadly (or “domain free”). Dr. Padmanabhan also shared a general framework for domain-driven deep learning in business research and used this framework to show how researchers can highlight significant contributions and position their own papers and ideas. Dr. Padmanabhan’s insightful cases and valuable research advice were greatly appreciated by the workshop audience from research communities of both computer science and management information systems.

In his talk, Dr. Padmanabhan also shared that his department has completed 100 projects in 7 years with about 30 companies, and funded postdoctoral research in analytics. His department has several outreach initiatives such as Economic Analytics Initiative and Florida Business Analytics Forum. Dr. Padmanabhan highlighted that such industrial collaborations and initiatives have greatly rewarded research activities particularly in domain-driven data mining projects. Dr. Padmanabhan encouraged researchers to actively reach out to industry not only when finding data but also to ask for new research questions.

3.3 Human resource management

We invited Dr. Hui Xiong for his keynote talk, “Artificial Intelligence in Human Resource Management.” Dr. Hui Xiong is a Distinguished Professor at the Rutgers, the State University of New Jersey. He also served as the Smart City Chief Scientist and the Deputy Dean of Baidu Research Institute in charge of several research laboratories. He is a co-Editor-in-Chief of Encyclopedia of GIS, an Associate Editor of IEEE Transactions on Big Data (TBD), ACM Transactions on Knowledge Discovery from Data (TKDD), and ACM Transactions on Management Information Systems (TMIS). Dr. Xiong has chaired for many international conferences in data mining, including a Program Co-Chair (2013) and a General Co-Chair (2015) for the IEEE International Conference on Data Mining (ICDM), and a Program Co-Chair of the Research Track (2018) and the Industry Track (2012) for the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Dr. Xiong’s research has generated substantive impact beyond academia. He is an ACM distinguished scientist and has been honored by the ICDM-2011 Best Research Paper Award, the 2017 IEEE ICDM Outstanding Service Award, and the 2018 Ram Charan Management Practice Award as the Grand Prix winner from the Harvard Business Review. In 2020, he was named as an AAAS Fellow and an IEEE Fellow.

Dr. Xiong shared a successful story in leveraging big data technology for human resource management. Indeed, the availability of large-scale human resource (HR) data has enabled unparalleled opportunities for business leaders to understand talent behaviors and generate useful talent knowledge, which in turn deliver intelligence for real-time decision making and effective people management at work. In his talk, Dr. Xiong introduced a powerful set of innovative Artificial Intelligence (AI) techniques developed for intelligent human resource management, such as recruiting, performance evaluation, talent retention, talent development, job matching, team management, leadership development, and organization culture analysis. With his rich experiences and close collaborations with the industry, Dr. Xiong demonstrated how the results of talent analytics can be used for other business applications, such as market trend analysis and financial investment.

4 Concluding remarks

This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. This special issue and related activities on recent advances in domain-driven data mining continued the previous efforts including the workshop series on the same topic during 2007–2014 with the IEEE International Conference on Data Mining and a special issue published by the IEEE Transactions on Knowledge and Data Engineering [ 44 ]. Although many scholars have made significant contributions with the theme of domain-driven data mining, there are still various new research problems and challenges calling for more research investigations in the coming years. We hope this special issue is helpful for scholars working along this critically important line of research.

Alves, G., Amblard, M., Bernier, F., Couceiro, M., Napoli, A.: Reducing unintended bias of ML models on tabular and textual data. In: DSAA, pp. 1–10 (2021)

Basak, A., Schmidt, K.M., Mengshoel, O.J.: From data to interpretable models: machine learning for soil moisture forecasting. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00347-8

Cao, L.: Domain-driven data mining: challenges and prospects. IEEE Trans. Knowl. Data Eng. 22 (6), 755–769 (2010)

Article Google Scholar

Cao, L.: Combined mining: analyzing object and pattern relations for discovering and constructing complex yet actionable patterns. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 3 (2), 140–155 (2013)

Cao, L.: Non-iidness learning in behavioral and social data. Comput. J. 57 (9), 1358–1370 (2014)

Cao, L.: Coupling learning of complex interactions. Inf. Process. Manag. 51 (2), 167–186 (2015)

Cao, L.: Non-iid recommender systems: a review and framework of recommendation paradigm shifting. Engineering 2 (2), 212–224 (2016)

Cao, L.: Data Science Thinking: The Next Scientific, Technological and Economic Revolution. Data Analytics. Springer, Berlin (2018)

Book Google Scholar

Cao, L.: A new age of AI: features and futures. IEEE Intell. Syst. 37 (1), 25–37 (2022)

Cao, L.: Beyond i.i.d.: non-iid thinking, informatics, and learning. IEEE Intell. Syst. 37 (04), 5–17 (2022)

Cao, L., Zhang, C.: Domain-driven actionable knowledge discovery in the real world. In: PAKDD 2006, pp. 821–830 (2006)

Cao, L., Zhang, C.: The evolution of kdd: towards domain-driven data mining. IJPRAI 21 (4), 677–692 (2007)

Google Scholar

Cao, L., Zhu, C.: Personalized next-best action recommendation with multi-party interaction learning for automated decision-making. PLoS ONE 17 , 1–22 (2022)

Cao, L., Luo, D., Zhang, C.: Knowledge actionability: satisfying technical and business interestingness. IJBIDM 2 (4), 496–514 (2007)

Cao, L., Zhang, C., Yang, Q., Bell, D.A., Vlachos, M., Taneri, B., Keogh, E.J., Yu, P.S., Zhong, N., Ashrafi, M.Z., Taniar, D., Dubossarsky, E., Graco, W.: Domain-driven, actionable knowledge discovery. IEEE Intell. Syst. 22 (4), 78–88 (2007)

Cao, L., Yu, P.S., Zhang, C., Zhao, Y.: Domain Driven Data Mining. Springer, Berlin (2010)

Book MATH Google Scholar

Cao, L., Philip, S.Y., Zhao, Z.: Shallow and deep non-iid learning on complex data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2022)

Carlevaro, A., Mongelli, M.: A new SVDD approach to reliable and explainable AI. IEEE Intell. Syst. 37 (2), 55–68 (2022)

Dey, A., Heger, A., England, D.: Urban fire station location planning using predicted demand and service quality index. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00328-x

Do, T.D.T., Cao, L.: Gamma-Poisson dynamic matrix factorization embedded with metadata influence. In: NeurIPS 2018, pp. 5829–5840 (2018)

He, F., Li, Y., Xu, T., Yin, L., Zhang, W., Zhang, X.: A data-analytics approach for risk evaluation in peer-to-peer lending platforms. IEEE Intell. Syst. 35 (3), 85–95 (2020)

Jasinska-Piadlo, A., Bond, R., Biglarbeigi, P., Brisk, R., Campbell, P., Browne, F., McEneaneny, D.: Data-driven versus a domain-led approach to k-means clustering on an open heart failure dataset. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00346-9

Jin, B., Yang, H., Sun, L., Liu, C., Qu, Y., Tong, J.: A treatment engine by predicting next-period prescriptions. In: KDD, pp. 1608–1616 (2018)

Kanter, J.M., Gillespie, O., Veeramachaneni, K.: Label, segment, featurize: a cross domain framework for prediction engineering. In: DSAA, pp. 430–439 (2016)

Ke, W., Liu, C., Shi, X., Dai, Y., Yu, P.S., Zhu, X.: Addressing exposure bias in uplift modeling for large-scale online advertising. In: ICDM, pp. 1156–1161 (2021)

Kompan, M., Gaspar, P., Macina, J., Cimerman, M., Bieliková, M.: Exploring customer price preference and product profit role in recommender systems. IEEE Intell. Syst. 37 (1), 89–98 (2022)

Lin, J.C.-W., Gan, W., Fournier-Viger, P., Hong, T.-P., Tseng, V.S.: Mining high-utility itemsets with various discount strategies. In: DSAA, pp. 1–10 (2015)

Liu, C., Zhu, W.: Precision coupon targeting with dynamic customer triage. In: DSAA, pp. 420–428 (2020)

Liu, Q., Zeng, X., Liu, C., Zhu, H., Chen, E., Xiong, H., Xie, X.: Mining indecisiveness in customer behaviors. In: ICDM, pp. 281–290 (2015)

Long, M., Wang, J., Sun, J.-G., Yu, P.S.: Domain invariant transfer kernel learning. IEEE Trans. Knowl. Data Eng. 27 (6), 1519–1532 (2015)

Ma, D., Narayanan, V.K., Liu, C., Fakharizadi, E.: Boundary salience: the interactive effect of organizational status distance and geographical proximity on coauthorship tie formation. Soc. Netw. 63 , 162–173 (2020)

Melucci, M.: Investigating sample selection bias in the relevance feedback algorithm of the vector space model for information retrieval. In: DSAA, pp. 83–89 (2014)

Nasir, M., Ezeife, C.I.: Semantic enhanced Markov model for sequential e-commerce product recommendation. Int. J. Data Sci. Anal., (2022) https://doi.org/10.1007/s41060-022-00343-y

O’Leary, D.E.: Ethics for big data and analytics. IEEE Intell. Syst. 31 (4), 81–84 (2016)

Pang, G., Cao, L., Chen, L.: Homophily outlier detection in non-iid categorical data. Data Min. Knowl. Discov. 35 (4), 1163–1224 (2021)

Article MATH Google Scholar

Ruiz-Dolz, R., Alemany, J., Barberá, S.H., García-Fornes, A.: Transformer-based models for automatic identification of argument relations: a cross-domain evaluation. IEEE Intell. Syst. 36 (6), 62–70 (2021)

Sun, H.-C., Lin, T.-Y., Tsai, Y.-L.: Performance prediction in major league baseball by long short-term memory networks. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-022-00313-4

Teng, M., Zhu, H., Liu, C., Xiong, H.: Exploiting network fusion for organizational turnover prediction. ACM Trans. Manag. Inf. Syst. 12 (2), 16:1-16:18 (2021)

Villanes, A., Healey, C.G.: Domain-specific text dictionaries for text analytics. Int. J. Data Sci. Analy., Special Issue on Domain-Driven Data Mining (2022)

Xiang, H., Lin, J., Chen, C.-H., Kong, Y.: Asymptotic meta learning for cross validation of models for financial data. IEEE Intell. Syst. 35 (2), 16–24 (2020)

Xu, L., Wei, X., Cao, J., Yu, P.S.: Multiple social role embedding. In: DSAA, pp. 581–589. IEEE (2017)

Yang, D., Bingqing, Q., Cudré-Mauroux, P.: Location-centric social media analytics: challenges and opportunities for smart cities. IEEE Intell. Syst. 36 (5), 3–10 (2021)

Yang, J., Liu, C., Teng, M., Xiong, H., Liao, M., Zhu, V.: Exploiting temporal and social factors for B2B marketing campaign recommendations. In: ICDM, pp. 499–508 (2015)

Zhang, C., Yu, P., Bell, D.: Introduction to the domain-drive data mining special section. IEEE Trans. Knowl. Data Eng. 22 (6), 753–754 (2010)

Zhang, J., He, M.: CRTL: context restoration transfer learning for cross-domain recommendations. IEEE Intell. Syst. 36 (4), 65–72 (2021)

Zhang, K., Chen, E., Liu, Q., Liu, C., Lv, G.: A context-enriched neural network method for recognizing lexical entailment. In: AAAI, pp. 3127–3134 (2017)

Zhang, Q., Cao, L., Zhu, C., Li, Z., Sun, J.: Coupledcf: learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering. In: IJCAI 2018, pp. 3662–3668 (2018)

Zhang, X., Wang, Y., Zhang, L., Jin, B., Zhang, H.: Exploring unsupervised multivariate time series representation learning for chronic disease diagnosis. Int. J. Data Sci. Anal. (2022). https://doi.org/10.1007/s41060-021-00290-0

Zhang, Y., Liu, G., Liu, A., Zhang, Y., Li, Z., Zhang, X., Li, Q.: Personalized geographical influence modeling for POI recommendation. IEEE Intell. Syst. 35 (5), 18–27 (2020)

Zhang, Y., Bai, G., Zhong, M., Li, X., Ryan, K.L.K.: Differentially private collaborative coupling learning for recommender systems. IEEE Intell. Syst. 36 (1), 16–24 (2021)

Zhang, Y., Zhang, X., Shen, T., Zhou, Y., Wang, Z.: Feature-option-action: a domain adaption transfer reinforcement learning framework. In: DSAA, pp. 1–12 (2021)

Zhang, Z., Liu, Q., Huang, Z., Wang, H., Lu, C., Liu, C., Chen, E.: Graphmi: extracting private graph data from graph neural networks. In: IJCAI, pp. 3749–3755 (2021)

Zhao, J., Lv, W., Du, B., Ye, J., Sun, L., Xiong, G.: Deep multi-task learning with flexible and compact architecture search. Int. J. Data Sci. Anal., Special Issue on Domain-Driven Data Mining (2022)

Zhao, Y., Zhang, H., Cao, L., Zhang, C., Bohlscheid, H.: Combined pattern mining: from learned rules to actionable knowledge. In: AI 2008, pp. 393–403 (2008)

Zhu, C., Cao, L., Yin, J.: Unsupervised heterogeneous coupling learning for categorical representation. IEEE Trans. Pattern Anal. Mach. Intell. 44 (1), 533–549 (2022)

Download references

Author information

Authors and affiliations.

The University of Tennessee, Knoxville, USA

Chuanren Liu

Snap Inc., Seattle, WA, USA

Ehsan Fakharizadi

University of Science and Technology of China, Hefei, China

University of Illinois Chicago, Chicago, USA

Philip S. Yu

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuanren Liu .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Liu, C., Fakharizadi, E., Xu, T. et al. Recent advances in domain-driven data mining. Int J Data Sci Anal 15 , 1–7 (2023). https://doi.org/10.1007/s41060-022-00378-1

Download citation

Published : 27 December 2022

Issue Date : January 2023

DOI : https://doi.org/10.1007/s41060-022-00378-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Find a journal
Publish with us
Track your research

data mining techniques Recently Published Documents

Total documents.

Latest Documents
Most Cited Documents
Contributed Authors
Related Sources
Related Keywords

Prediction of Skin Diseases Using Machine Learning

Skin disease rates have been increasing over the past few decades. It has led to both fatal and non-fatal disabilities all around the world, especially in those areas where medical resources are not good enough. Early diagnosis of skin diseases increases the chances of cure significantly. Therefore, this work is comparing six machine learning algorithms, namely KNN, random forest, neural network, naïve bayes, logistic regression, and SVM, for the prediction of the skin diseases. The information gain, gain ratio, gini decrease, chi-square, and relieff are used to rank the features. This work comprises the introduction, literature review, and proposed methodology parts. In this research paper, a new method of analyzing skin disease has been proposed in which six different data mining techniques are used to develop an ensemble method that integrates all the six data mining techniques as a single one. The ensemble method used on the dermatology dataset gives improved result with 94% accuracy in comparison to other classifier algorithms and hence is more effective in this area.

A Survey on Building Recommendation Systems Using Data Mining Techniques

Classification is a data mining technique or approach used to estimate the grouped membership of items on a basis of a common feature. This technique is virtuous for future planning and discovering new knowledge about a specific dataset. An in-depth study of previous pieces of literature implementing data mining techniques in the design of recommender systems was performed. This chapter provides a broad study of the way of designing recommender systems using various data mining classification techniques of machine learning and also exploiting their methodological decisions in four aspects, the recommendation approaches, data mining techniques, recommendation types, and performance measures. This study focused on some selected classification methods and can be so supportive for both the researchers and the students in the field of computer science and machine learning in strengthening their knowledge about the machine learning hypothesis and data mining.

A Classification and Clustering Approach Using Data Mining Techniques in Analysing Gastrointestinal Tract

Diagnosis and detection of plant diseases using data mining techniques, location-based crime prediction using multiclass classification data mining techniques, an effective approach to test suite reduction and fault detection using data mining techniques.

Software testing is used to find bugs in the software to provide a quality product to the end users. Test suites are used to detect failures in software but it may be redundant and it takes a lot of time for the execution of software. In this article, an enormous number of test cases are created using combinatorial test design algorithms. Attribute reduction is an important preprocessing task in data mining. Attributes are selected by removing all weak and irrelevant attributes to reduce complexity in data mining. After preprocessing, it is not necessary to test the software with every combination of test cases, since the test cases are large and redundant, the healthier test cases are identified using a data mining techniques algorithm. This is healthier and the final test suite will identify the defects in the software, it will provide better coverage analysis and reduces execution time on the software.

Dengue Fever Prediction Modelling using Data Mining Techniques

Applying data mining techniques to classify patients with suspected hepatitis c virus infection, fake news detection using data mining techniques.

Nowadays, internet has been well known as an information source where the information might be real or fake. Fake news over the web exist since several years. The main challenge is to detect the truthfulness of the news. The motive behind writing and publishing the fake news is to mislead the people. It causes damage to an agency, entity or person. This paper aims to detect fake news using semantic search.

A Leading Indicator Approach with Data Mining Techniques in Analysing Bitcoin Market Value

Export citation format, share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts

Data mining articles within Nature Methods

Brief Communication 04 July 2024 | Open Access

quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data

Scalable tools are needed for the analysis of increasingly large mass spectrometry-based proteomics datasets. quantms offers an open-source, cloud-based pipeline for massively parallel proteomics data analysis.

Chengxin Dai
, Julianus Pfeuffer
& Yasset Perez-Riverol

Article 01 July 2024 | Open Access

Predicting glycan structure from tandem mass spectrometry via deep learning

CandyCrunch is a deep learning-based tool for predicting glycan structures from tandem mass spectrometry data. The paper also introduces CandyCrumbs that automatically annotates fragment ions in higher-order tandem mass spectrometry spectra.

James Urban
, Chunsheng Jin
& Daniel Bojar

Article | 16 May 2024

Indexing and searching petabase-scale nucleotide resources

The Pebblescout tool achieves an efficient search for subjects in a large nucleotide database such as runs in Sequence Read Archive data.

Sergey A. Shiryev
& Richa Agarwala

Brief Communication 20 March 2024 | Open Access

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

SQANTI3 offers a flexible tool for quality control, curation and annotation of long-read RNA sequencing data.

Francisco J. Pardo-Palacios
, Angeles Arzalluz-Luque
& Ana Conesa

Research Highlight | 11 January 2024

Chroma is a generative model for protein design

Arunima Singh

Article 08 January 2024 | Open Access

A fast, scalable and versatile tool for analysis of single-cell omics data

SnapATAC2 uses a matrix-free spectral embedding algorithm for nonlinear dimension reduction of single-cell omics data, which shows an improved performance in capturing cellular heterogeneity and scalability for large datasets.

, Nathan R. Zemke
& Bing Ren

Research Briefing | 04 December 2023

SEVtras characterizes cell-type-specific small extracellular vesicle secretion

Although single-cell RNA-sequencing has revolutionized biomedical research, exploring cell states from an extracellular vesicle viewpoint has remained elusive. We present an algorithm, SEVtras, that accurately captures signals from small extracellular vesicles and determines source cell-type secretion activity. SEVtras unlocks an extracellular dimension for single-cell analysis with diagnostic potential.

Article 04 December 2023 | Open Access

SEVtras delineates small extracellular vesicles at droplet resolution from single-cell transcriptomes

SEVtras is an algorithm that uses single-cell RNA sequencing data to assess small extracellular vesicle activity at droplet resolution.

, Junjie Zhu
& Fangqing Zhao

Research Highlight | 11 July 2023

Speedier protein structure search

Article 23 January 2023 | Open Access

Convolutional networks for supervised mining of molecular patterns within cellular context

DeePiCt (deep picker in context) is a versatile, open-source deep-learning framework for supervised segmentation and localization of subcellular organelles and biomolecular complexes in cryo-electron tomography.

Irene de Teresa-Trueba
, Sara K. Goetz
& Judith B. Zaugg

Resource | 07 November 2022

High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations

This Resource presents and analyzes four datasets containing both gene expression and morphological profile data for cells subjected to hundreds to thousands of chemical or genetic perturbations and highlights their complementary nature.

Marzieh Haghighi
, Juan C. Caicedo
& Shantanu Singh

Article | 27 October 2022

Netie: inferring the evolution of neoantigen–T cell interactions in tumors

Netie, a hierarchical Bayesian model to infer the neoantigen evolution and immune selection pressure during tumor progression.

, Seongoh Park
& Tao Wang

Article 25 July 2022 | Open Access

Self-supervised deep learning encodes high-resolution features of protein subcellular localization

Cytoself is a self-supervised deep learning-based approach for profiling and clustering protein localization from fluorescence images. Cytoself outperforms established approaches and can accurately predict protein subcellular localization.

Hirofumi Kobayashi
, Keith C. Cheveralls
& Loic A. Royer

Research Highlight | 11 March 2022

Viral discovery at a global scale

Ultra-high-throughput sequence alignment enables identification of >130,000 novel RNA viruses.

Correspondence | 25 June 2021

gEAR: Gene Expression Analysis Resource portal for community-driven, multi-omic data exploration

Joshua Orvis
, Brian Gottfried
& Ronna Hertzano

Article | 01 March 2021

Fast searches of large collections of single-cell data using scfind

Advances in single-cell sequencing technologies enable generation of datasets of millions of cells. scfind facilitates efficient and sophisticated gene search in massive single-cell datasets.

Jimmy Tsz Hang Lee
, Nikolaos Patikas
& Martin Hemberg

Method to Watch | 06 January 2021

Diving into the TCR repertoire

Computational approaches help us explore complexities of the T cell receptor repertoire.

Madhura Mukhopadhyay

News & Views | 26 October 2020

Getting more for less: new software solutions for glycoproteomics

Algorithms from the Nesvizhskii and Smith groups facilitate faster and more comprehensive glycopeptide assignments from mass spectrometry datasets.

Jeremy L. Praissman
& Lance Wells

Brief Communication | 27 June 2019

Pathway-level information extractor (PLIER) for gene expression data

Pathway-level information extractor (PLIER) uses prior knowledge of pathways to extract biologically interpretable latent variables from large gene expression datasets.

Weiguang Mao
, Elena Zaslavsky
& Maria Chikina

Brief Communication | 28 March 2019

A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms

A computational approach facilitates molecular formula, metabolite class, and structure assignment for plant metabolites on the basis of LC–MS analysis of fully 13 C-labeled and unlabeled plants.

Hiroshi Tsugawa
, Ryo Nakabayashi
& Kazuki Saito

Brief Communication | 01 October 2018

Qiita: rapid, web-enabled microbiome meta-analysis

The Qiita web platform provides access to large amounts of public microbial multi-omic data and enables easy analysis and meta-analysis of standardized private and public data.

Antonio Gonzalez
, Jose A. Navas-Molina
& Rob Knight

This Month | 27 April 2018

Harris Wang

A regulatory vocabulary for synthetic biology and why baby diapers matter.

Vivien Marx

Resource | 19 March 2018

Metagenomic mining of regulatory elements enables programmable species-selective gene expression

Metagenomic mining generates a rich resource of regulatory sequences with species-selective and universal activity, making it possible to engineer synthetic circuits with tunable gene expression across diverse bacterial hosts.

Nathan I Johns
, Antonio L C Gomes
& Harris H Wang

Brief Communication | 08 January 2018

GIGGLE: a search engine for large-scale integrated genome analysis

GIGGLE is a genome interval search engine that enables extremely fast queries of genome features from thousands of genome annotation sets.

Ryan M Layer
, Brent S Pedersen
& Aaron R Quinlan

Brief Communication | 27 November 2017

Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics

An integrated cheminformatics workflow aids the functional and structural annotation of unknown metabolites found across multiple biological systems.

, Hiroshi Tsugawa
& Oliver Fiehn

This Month | 30 May 2017

Clustering finds patterns in data—whether they are there or not.

Naomi Altman
& Martin Krzywinski

Resource | 27 June 2016

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets

A newly developed algorithm enabled clustering of all 256 million (66 million identified and 190 million unidentified) peptide MS/MS spectra available in the PRIDE Archive database, allowing the detection of millions of consistently unidentified spectra across different data sets, of which roughly 20% could be identified using multiple complementary analysis tools.

Johannes Griss
, Yasset Perez-Riverol
& Juan Antonio Vizcaíno

Brief Communication | 16 May 2016

Automated mapping of phenotype space with single-cell data

X-shift software allows automated mapping of phenotypic space from large mass cytometry data sets. X-shift and the new representation algorithm Divisive Marker Tree provide a rapid, deterministic approach to navigating complex cellular systems.

Nikolay Samusik
, Zinaida Good
& Garry P Nolan

Tools in Brief | 29 September 2015

How good are those RNA-seq data?

Technology Feature | 30 July 2015

Human phenotyping on a population scale

Large-scale phenotyping is generating much data that geneticists can harness. Amid the excitement about the possibilities, there are some points of caution.

Correspondence | 30 July 2014

Proper reporting of predictor performance

Mauno Vihinen

Correspondence | 28 March 2014

A fair comparison

Paul I Costea
, Georg Zeller
& Peer Bork

Reply to: "A fair comparison"

Joseph N Paulson
, Héctor Corrada Bravo
& Mihai Pop

Brief Communication | 13 October 2013

DGIdb: mining the druggable genome

A database of known drug-gene interactions, with information derived from many public sources, allows the identification of genes that are currently targeted by a drug and the membership of genes in a category, such as kinase genes, that have a high potential for drug development.

Malachi Griffith
, Obi L Griffith
& Richard K Wilson

Brief Communication | 29 September 2013

Differential abundance analysis for microbial marker-gene surveys

The metagenomeSeq tool robustly detects the differential abundance of microbes in marker-based microbial surveys by tackling the problems of data sparsity and undersampling common to these data sets.

, O Colin Stine

Correspondence | 27 September 2013

ExpressionBlast: mining large, unstructured expression databases

Guy E Zinman
, Shoshana Naiman
& Ziv Bar-Joseph

Methods in Brief | 30 July 2013

Mining for lateral gene transfer

Correspondence | 01 February 2010

IntOGen: integration and data mining of multidimensional oncogenomic data

Gunes Gundem
, Christian Perez-Llamas
& Nuria Lopez-Bigas

Browse broader subjects

Computational biology and bioinformatics

Quick links

Explore articles by subject
Guide to authors
Editorial policies

IMAGES

Data Mining Research Papers
Scholarly data mining: A systematic review of its applications
😍 Data mining research paper. What are some good research topics in
Data Mining
Latest Research Papers in Data Mining
HIGH QUALITY DATA MINING DISSERTATION

COMMENTS

data mining Latest Research Papers
It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest.
Data mining
Data mining is used in computational biology and bioinformatics to detect trends or patterns without knowledge of the meaning of the data. Latest Research and Reviews.
345193 PDFs
Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING
Data mining articles within Scientific Reports
Read the latest Research articles in Data mining from Scientific Reports. ... Data mining articles within Scientific Reports. ... Calls for Papers Editor's Choice ...
Data Mining Methods and Obstacles: A Comprehensive Analysis
Big data analytics: a li terature review paper. in Advances in Data Mining. Applications and Theo- Applications and Theo- retical Aspects: 14th Industrial Conference, ICDM 2014, St. Petersburg ...
Home
Data Mining and Knowledge Discovery is a leading technical journal focusing on the extraction of information from vast databases. Publishes original research papers and practice in data mining and knowledge discovery. Provides surveys and tutorials of important areas and techniques. Offers detailed descriptions of significant applications.
Recent advances in domain-driven data mining
Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...
data mining techniques Latest Research Papers
Find the latest published documents for data mining techniques, Related hot topics, top authors, the most cited documents, and related journals. ... In this research paper, a new method of analyzing skin disease has been proposed in which six different data mining techniques are used to develop an ensemble method that integrates all the six ...
Data mining
Read the latest Research articles in Data mining from Nature Methods ... CandyCrunch is a deep learning-based tool for predicting glycan structures from tandem mass spectrometry data. The paper ...
Artificial intelligence-aided data mining of medical records for cancer
The application of artificial intelligence methods to electronic patient records paves the way for large-scale analysis of multimodal data. Such population-wide data describing deep phenotypes composed of thousands of features are now being leveraged to create data-driven algorithms, which in turn has led to improved methods for early cancer detection and screening. Remaining challenges ...