Date: Monday, June 17th 2024
Time: 12:15 p.m. (st)
Location: Cafe Nordlicht
Forschungsdaten sind in den Geisteswissenschaften allgegenwärtig und Forschende müssen mit ihnen souverän umgehen können. Dies erfordert eine kontinuierliche Auseinandersetzung mit digitalen Kompetenzen, ein Bereich, dem sich die Digital Humanities seit vielen Jahren widmen. Mit der zunehmenden Verbreitung generativer Sprachmodelle hat sich der Erwerb von Wissen nicht nur beschleunigt, sondern auch individualisiert. Dieser Vortrag beleuchtet genau diese Thematik und präsentiert ChatGPT als ein leistungsstarkes Instrument zur Förderung der Digital Literacy.
Der Vortrag beginnt mit einer konzeptuellen Einführung in die Digital Literacy und hebt die essenzielle Rolle digitaler Texte in der wissenschaftlichen Arbeit hervor. Dabei wird auch das Projekt Text+, ein Konsortium innerhalb der Nationalen Forschungsdateninfrastruktur (NFDI), kurz vorgestellt.
Ein genauerer Blick richtet sich auf verschiedene Aspekte der Repräsentation digitaler Texte, angefangen von der Zeichenkodierung bis hin zur Annotation der Analyseergebnisse. Insbesondere wird das XML-basierte Annotationsframework der Text Encoding Initiative (TEI) näher betrachtet.
Neben der Repräsentation digitaler Texte wird auch die Analyse und Weiterverarbeitung der digitalen Forschungsressourcen angesprochen. Hierfür sind Programmierkenntnisse unabdingbar, weshalb auch diese erworben werden müssen. Insbesondere wird auf die Programmiersprache Python und die Open-Source-Bibliothek spaCy eingegangen, die als leistungsfähiges Werkzeug zur Verarbeitung natürlicher Sprache dient.
Dieser Vortrag basiert auf einem Kurs für Studierende der Germanistik an der Universität Mannheim, der im Frühjahrssemester 2024 stattfand. Die Erfahrungen und Erkenntnisse aus diesem Kurs "Digital Literacy mit ChatGPT" sind nicht nur für zukünftige Kurse anwendbar, sondern können auch auf ähnliche Kurse in anderen Disziplinen übertragen werden, um ein breiteres Verständnis für die digitale Kompetenz zu fördern und die Integration digitaler Werkzeuge in die akademische Forschung zu erleichtern.
Date: Friday, December 13th 2019
Time: 11:00 a.m. (st)
Location: C2-136
Knowing the travel time for a given route is of importance for logistics applications, but also for individual travel, as is documented by the provision of predictions for example in Google maps. In both applications we do not only want to know the expected travel time but also the associated uncertainty as typically being late might incur a larger penalty than being too early.
For predicting travel times a large range of different time series methods are applied using a very diverse landscape of data sources with associated strengths and problems. The methods used include a large number of time series analysis techniques dealing with univariate and multivariate data sets, linear and non-linear models.
In this talk I will describe how some of the underlying problems are solved using insights from domain knowledge and statistical data analysis methods. The main underlying theme is that brute force purely data driven modelling does not work. Also only theory driven modelling typically is not sufficient. It is the combination of these two approaches that leads to success.
I will also hint at some of the current challenges, both technological and institutional. This will lead to my answers to the question: Are we there yet, can we provide reliable travel time predictions?
Date: Friday, January 10th 2020
Time: 11:00 a.m. (s.t.)
Location: C2-136
A fundamental function of cultural heritage projects and institutions consists in the collection, cataloguing and persistent storage of material of human culture. Challenges emerge with upcoming needs to digitize such material and to provide it to research communities or the general public. A large amount of cultural heritage collections are stored in depots and might never reach a digitally published state - mainly due to the necessary selectivity of digitization projects and the amount and diversity of artifacts that are being acquired and collected. With this respect, the openness, interoperability and sustainability of repositories are of particular importance as these requirements form a foundation for a long-term preservation of digital cultural heritage along with the scholarly discourse.
In this talk, we discuss the value and benefits that an adoption of the FAIR-principles [Fußnote: findable, accessible, interoperable, reusable; see: https://www.go-fair.org/fair-principles] has on research data management in the digital humanities. After a short glance on current perspectives and requirements of funding organizations with regard to the FAIR-principles and their adoption, we present results of the Digital Research Infrastructure for the Arts and Humanities (DARIAH). The BMBF-funded initiative has developed strategies and services, which facilitate the introduction of FAIR-principles to collections of cultural heritage. In our talk, we particularly focus on methods that allow the explication of contextual knowledge on existing collections to achieve interpretability of their content in integrative contexts and hence supporting such collections to become FAIR.
Date: Friday, January 17th 2020
Time: 11:00 a.m. (s.t.)
Location: C2-136
Sensors embedded in mobile phones allow for analyzing contextual factors, such as consumers' current outdoor location, and use it for targeting purposes, which was shown to improve customer responses to mobile promotions. However, many customers also use their phones inside of stores, which provides opportunities for better understanding in-store behavior and for location-based targeting during the actual shopping process. We present the results of a unique large-scale field test, which we conducted in cooperation with a fashion retailer. Specifically, we developed and employed a mobile application that is capable of tracking customers in the store and of delivering individualized promotion messages. We use the application to examine the effects of in-store behavior on purchases, and to conduct a between-subjects experiment that tests the effects of different location-based promotion types.
There is, unfortunately no recording of this lecture available. We apologise.
Date: Friday, January 24th 2020
Time: 11:00 a.m. (st)
Location: C2-136
The free flow of information is the life blood of digital industries and much of modern science. Yet collaboration is impeded by inefficiently low data sharing as proprietary data is kept in private siloes. Often, data exchange is prevented by competition and lack of trust between data owners, consumer and privacy concerns, and data protection regulation. Is there a way to reconcile digital collaboration with data privacy?
Secure Multiparty Computation (SMPC), a disruptive technology, promises to do just that. It simulates a virtual trusted third party as a cryptographic network between the parties. The network only exists as long as all parties actively engage in the the calculation, and the joint result is distributed to all. The individual private data, however, remains with the original data owners, and nothing can be learned by an external or internal attacker except for the intended result of the computation.
In many ways, secure multiparty computation is similar to the blockchain. Both work in trustless settings without a central authority. However, while distributed ledger technologies provide trust, reliability, and transparency; secure multiparty computation provides privacy, security, dynamic consent, and control over proprietary data.
This talk provides a brief non-technical introduction to secure multiparty computation. A proof-of-principle is presented in which proprietary patient data was jointly evaluated between two remote university hospitals.
Date: Januar, 31st 2020
Time: 11:00 a.m. (st)
Location: C2-136
From 2015 to 2019, the Faculty of Linguistics and Literary Studies of Bielefeld University carried out the DFG project Kinder und Jugendliteratur im Medienverbund 1933 -1945. The University Library has developed the technical infrastructure for data acquisition, data processing and data visualisation. Based on a complex data model for describing different types of media (especially films, radio broadcasts and print editions, but also theatre performances, records, television broadcasts, advertising material) as a manifestation of literary material, the different linkages have to be worked out and visualized by visualization techniques. The lecture introduces the used approaches, methods and their implementation in detail and also reports on the use of these techniques in other digital services of the University Library. This includes in particular analyses of the global publication network in the context of bibliometric information.
Date: February 14th, 2020
Time: 11:00 a.m. (st)
Location: C2-136
At present R and Python are the most widely used programming languages for data science. R, based on the earlier language S, developed at Bell Labs in the 1980's, is, by design, intended for data analysis and graphics applications.
Python, developed in the 1990's for general applications, has seen widespread adoption in the scientific computing and data science communities. Both of these are dynamic languages that can be used interactively in a REPL (read-eval-print-loop) allowing for rapid prototyping of algorithms and on-the-fly analyses. Both languages also have well-developed infrastructure including integrated development environments and thousands of user-contributed packages available on repositories. Both languages are Open Source and freely available.
As a member of the core development team for R I have considerable experience using R and developing packages for R. I have also used Python though not to the same extent. Despite a long history with these languages I recently switched my development efforts to Julia, primarily for flexibility.
I will compare and contrast these languages and explain why I choose to use the less mature but more flexible one for data science.
Prof. Dr. Christiane Fuchs
Data Science
Bielefeld University
Date: Friday, October 19th
Time: 11:00 a.m. (s.t.)
Location: G2-104 (CeBiTec)
Data science refers to the acquisition of knowledge from data. The field is increasingly important as more and more data is being collected in a rising number of areas and has great potential for optimizing processes. One of these areas is medicine. Medical data is growing in volume and complexity, for example through the accelerated digitization of patient data or improved high-throughput technologies in molecular biology. Processing such data requires both statistical and computational expertise as well as medical domain knowledge for correct interpretation.
I will present projects in which interdisciplinary cooperation between statistics / data science and medicine / biology led to a gain in knowledge and thus to improved diagnostics: On the one hand from the field of risk prediction for prostate cancer patients and childhood asthma; here we were able to use and combine machine learning techniques in such a way that the complex data base available was used more effectively and we were able to improve previously used forecasting methods. On the other hand, for the detection of transcriptional heterogeneity in leukemia patients; here, we use an RNA measurement method that sequences small amounts of cells rather than single cells to reduce cost, effort, unwanted effects of cell isolation and, most importantly, technical errors. In the case of heterogeneous cell populations, the resulting data is a tangle of signals. A statistical algorithm developed by us can extract the single-cell information again.
Unfortunately there is no recording of this lecture available. Our apologies!
Prof. Dr. Philipp Cimiano
Semantic Computing
Bielefeld University
Date: Friday, December 14th
Time: 11:00 a.m. (s.t.)
Location: C2-136
In many knowledge-intensive areas, experts are overwhelmed with information. As the number of publications in many fields is growing at an exponential rate, it is becoming harder and harder for experts to keep up and distill the "evidence" or knowledge from the available published knowledge. New approaches to structuring and managing evidence to support insight generation and answering of specific questions are crucial. Machines can support the task of structuring the available evidence. But there are a number of challenges to face. First, knowledge is published in unstructured form, so we need to teach machines to extract the relevant insights from published articles. Second, we need ontologies or knowledge representation approaches to represent the knowledge to support cross-document aggregation as rarely only one document contains the relevant answer or insight to a question.
We present results of the BMBF-funded project PSINK that seeks to develop a novel approach to systematizing evidence in the field of medicine, in particular in the area dealing with the treatment of spinal cord injuries, for which no successful treatment exists as of today. The project builds on natural language processing techniques and semantic technologies to develop a knowledge base that eventually should contain all the pre-clinical knowledge available on the efficacy of spinal cord injury therapies. This will support a novel concept what we call "meta-analysis on demand".
Unfortunately no slides are available for this lecture. Our apologies!
Tip: Click on "View in Panopto" for more video features!
Dr. Alexander Sczyrba
Computational Metagenomics
Bielefeld University
Date: Friday, January 18th
Time: 11:00 a.m. (s.t.)
Location: C2-136
Microorganisms are the most abundant cellular life forms on Earth, occupying even the most extreme environments. The large majority of these organisms have not been obtained in pure culture and therefore have long been inaccessible to genome sequencing, which would provide blueprints for the evolutionary and functional diversity that shapes our biosphere. Since the advent of next-generation sequencing technologies, metagenomics and single cell genomics can shed light on the uncharted branches of the tree of life. While representing very complementary approaches, both technologies can recover microbial genomes from environmental samples, with their own strengths and weaknesses.
More than 100,000 metagenomic datasets of hundreds of Terabytes in size are currently available in public data repositories and can be mined for new representatives of candidate phyla to obtain genomes of the underrepresented branches of the tree of life. Also, these metagenomic data sets are invaluable in bioprospecting, an approach for screening for molecules and activities from environmental samples with biotechnological potential. However, for many small research labs these data remain inaccessible due to the lack of computational resources.
Cloud computing offers a solution, as it provides compute and storage capacities at scale. The CeBiTec at Bielefeld University is operating an OpenStack-based cloud computing infrastructure for the life science community within the German Network for Bioinformatics Infrastructure (de.NBI). The de.NBI Cloud (https://cloud.denbi.de/) is a full academic cloud federation, providing compute and storage resources free of charge for academic users. It provides a powerful IT infrastructure in combination with flexible bioinformatics workflows and analysis tools to the life science community in Germany. In my presentation I will show how the de.NBI Cloud can close the gap of missing computational resources for metagenomics research as one example.
Unfortunately no slides are available for this lecture. Our apologies!
Tip: Click on "View in Panopto" for more video features!
Prof. Dr. Karsten Lübke
Mathematical Economics and Statistics
FOM University of Applied Science
Date: Friday, January 25th
Time: 11:00 a.m. (s.t.)
Location: G2-104 (CeBiTec)
Statistical thinking and computational skills together with real world applications are regarded as fundamental elements in data literacy education. Data modeling and simulation based inference may facilitate conceptual understanding in all domains of data literacy. This can be achieved by a re-thinking of the consensus based curriculum. It is time to start - and for a first review of the lessons learned.
Prof. Dr. Claus Weihs
Computer-Supported Statistics
TU Dortmund
Date: Wednesday, January 30th
Time: 10:00 a.m. (s.t.)
Location: C2-136
In this talk we structure the field of data science and substantiate our key premise that statistics is one of the most important disciplines in data science and the most important discipline to analyze and quantify uncertainty. As an application, the talk demonstrates data science methods on music data for automatic transcription and automatic genre determination, both on the basis of signal-based features from audio recordings of music pieces.
Literature:
Claus Weihs und Katja Ickstadt (2018): Data Science: The Impact of Statistics; International Journal of Data Science and Analytics 6, 189-194
Claus Weihs, Dietmar Jannach, Igor Vatolkin und Günter Rudolph (2017): Music Data Analysis: Foundations and Applications; CRC Press, Taylor & Francis, 675 pages
Dr. Jan Goebel
German Socio- Economic Panel Study
German Institute for Economic Research
Date: Wednesday , May 22nd
Time: 10:15 a.m.
Location: X-E0-220
Nearly 15,000 households and about 30,000 persons participate in the SOEP survey. The SOEP provides a broad set of self-reported "objective" variables, such as income, age, gender, education, employment status, or gripping force, and a broad set of self-reported "subjective" variables, such as from satisfaction with life, over fairness and reciprocity perceptions to psychological measurement like the "Big Five."
Running for already 35 years, SOEP gathers information from a spectrum of birth cohorts. As such, it is a valuable empirical basis for researchers to explore long-time societal changes; relationships between early life events on later life outcomes; interdependencies between the individual and the family or household; mechanisms of inter-generational mobility and transmission; accumulation processes of resources; short- and long-term effects of institutional change and policy reforms; speed of convergence between East and West or between migrants and natives.
The talk will give an overview about the basic features of the SOEP - from the basic sampling strategy to the structure of the released data. How external users can access the data and in which ways the SOEP data can be enriched using auxiliary datasets like geocoded data. Addtionally it will also give an overview about the SOEP Innovation Sample (SOEP-IS) and how external researchers can submit proposals. SOEP-IS can accommodate not only short-term experiments but also longer-term survey modules that are not suitable for SOEP-Core, whether because the survey instruments are still relatively new or because of the specific issues dealt with in the research.
Dr. Silke Schwandt
Medival History
Bielefeld University
In the 2017/18 winter term BiCDaS offered a lecture series on selected Data Science topics. Six different experts delivered presentations that provided insight into their research in various fields of data science.
Although the lecture series is over, you can still watch most talks below. It is a great collection of resources and showcases the wide range of topics in Data Science.
Prof. Dr. Johannes Blömer
Codes and Cryptography
University of Paderborn
Dr. Thomas Hermann
Ambient Intelligence
Bielefeld University
Unfortunately there is no recording of this lecture available. Our apologies!
Dr. Odile Sauzet
Statistical Consulting Center
Bielefeld University
Prof. Dr. Achim Streit
Steinbuch Centre for Computing
Karlsruhe Institute of Technology