Jiang Bian


2023

pdf bib
On the Impact of Cross-Domain Data on German Language Models
Amin Dada | Aokun Chen | Cheng Peng | Kaleb Smith | Ahmad Idrissi-Yaghir | Constantin Seibold | Jianning Li | Lars Heiliger | Christoph Friedrich | Daniel Truhn | Jan Egger | Jiang Bian | Jens Kleesiek | Yonghui Wu
Findings of the Association for Computational Linguistics: EMNLP 2023

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.

pdf bib
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models
Dingyao Yu | Kaitao Song | Peiling Lu | Tianyu He | Xu Tan | Wei Ye | Shikun Zhang | Jiang Bian
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

AI-empowered music processing is a diverse feld that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classifcation). For developers and amateurs, it is very diffcult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks. Consequently, it is necessary to build a system to organize and integrate these tasks, and thus help practitioners to automatically analyze their demand and call suitable tools as solutions to fulfill their requirements. Inspired by the recent success of large language models (LLMs) in task automation, we develop a system, named MusicAgent, which integrates numerous music-related tools and an autonomous workflow to address user requirements. More specifically, we build 1) toolset that collects tools from diverse sources, including Hugging Face, GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., ChatGPT) to organize these tools and automatically decompose user requests into multiple sub-tasks and invoke corresponding music tools. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect. By granting users the freedom to effortlessly combine tools, the system offers a seamless and enriching music experience. The code is available on GitHub along with a brief instructional video.

2022

pdf bib
KGE-CL: Contrastive Learning of Tensor Decomposition Based Knowledge Graph Embeddings
Zhiping Luo | Wentao Xu | Weiqing Liu | Jiang Bian | Jian Yin | Tie-Yan Liu
Proceedings of the 29th International Conference on Computational Linguistics

Learning the embeddings of knowledge graphs (KG) is vital in artificial intelligence, and can benefit various downstream applications, such as recommendation and question answering. In recent years, many research efforts have been proposed for knowledge graph embedding (KGE). However, most previous KGE methods ignore the semantic similarity between the related entities and entity-relation couples in different triples since they separately optimize each triple with the scoring function. To address this problem, we propose a simple yet efficient contrastive learning framework for tensor decomposition based (TDB) KGE, which can shorten the semantic distance of the related entities and entity-relation couples in different triples and thus improve the performance of KGE. We evaluate our proposed method on three standard KGE datasets: WN18RR, FB15k-237 and YAGO3-10. Our method can yield some new state-of-the-art results, achieving 51.2% MRR, 46.8% Hits@1 on the WN18RR dataset, 37.8% MRR, 28.6% Hits@1 on FB15k-237 dataset, and 59.1% MRR, 51.8% Hits@1 on the YAGO3-10 dataset.

2021

pdf bib
Revisiting the Evaluation of End-to-end Event Extraction
Shun Zheng | Wei Cao | Wei Xu | Jiang Bian
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf bib
Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction
Shun Zheng | Wei Cao | Wei Xu | Jiang Bian
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at https://github.com/dolphin-zs/Doc2EDAG.

2016

pdf bib
An Analysis of WordNet’s Coverage of Gender Identity Using Twitter and The National Transgender Discrimination Survey
Amanda Hicks | Michael Rutherford | Christiane Fellbaum | Jiang Bian
Proceedings of the 8th Global WordNet Conference (GWC)

While gender identities in the Western world are typically regarded as binary, our previous work (Hicks et al., 2015) shows that there is more lexical variety of gender identity and the way people identify their gender. There is also a growing need to lexically represent this variety of gender identities. In our previous work, we developed a set of tools and approaches for analyzing Twitter data as a basis for generating hypotheses on language used to identify gender and discuss gender-related issues across geographic regions and population groups in the U.S.A. In this paper we analyze the coverage and relative frequency of the word forms in our Twitter analysis with respect to the National Transgender Discrimination Survey data set, one of the most comprehensive data sets on transgender, gender non-conforming, and gender variant people in the U.S.A. We then analyze the coverage of WordNet, a widely used lexical database, with respect to these identities and discuss some key considerations and next steps for adding gender identity words and their meanings to WordNet.

pdf bib
Solving Verbal Questions in IQ Test by Knowledge-Powered Word Embedding
Huazheng Wang | Fei Tian | Bin Gao | Chengjieren Zhu | Jiang Bian | Tie-Yan Liu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Co-learning of Word Representations and Morpheme Representations
Siyu Qiu | Qing Cui | Jiang Bian | Bin Gao | Tie-Yan Liu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
A Probabilistic Model for Learning Multi-Prototype Word Embeddings
Fei Tian | Hanjun Dai | Jiang Bian | Bin Gao | Rui Zhang | Enhong Chen | Tie-Yan Liu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers