This is a collection of resources (tools, software, literature, etc.) members of BiCDaS have found useful in their daily work. We also included a list of organizations Data Science enthusiasts may find appealing.
This list is by no means comprehensive and will be extended continually.
EuADS European Association for Data Science was founded only recently and aims to foster cooperation and communication among Data Scientists in Europe
GfKl The German Gesellschaft für Klassifikation (roughly Society for Classification) celebrates its 40th anniversary in 2017. It has about 300 members and aims to promote data classification and signal processing. Recently Data Science Society has been added as a second name.
DHd The German platform Digital Humanities im deutschsprachigen Raum (roughly Digital Humanities in German-speaking countries) claims to represent the interests of Digital Humanities researchers. Founded in 2013 it has about 400 members (as of 2021).
The Jupyter Notebook is a useful tool for data exploration. Code, plot the results, insert formatted text in any sequence and thereby have your code, results and research notes all intermingled at your fingertips. The Jupyter project originated in Python but can be configured for a wide array of languages. It is continually developed as an open source project coordinated at the Berkeley Institute for Data Science (BIDS).
Python is an all purpose programming language which has gained tremendous popularity in the Data Science community in the last few years. It offers high code readability, high expressiveness and a high-level command set. Its "batteries included" philosophy together with large eco-system of open-source libraries has added substantially to its popularity and utility.
Some modules (libraries) of particular use for data scientist are: numpy, scipy, scikit-learn, pandas and scikit-image.
The Apache Flink plugin is used for data processing. While it can process finite data sets (batch-mode), it really shines in the processing of continuous data streams. Its integration in the Apache (Web-)Server Software offers some advantages such as cluster-mode (many hosts involved in processing) and fault tolerance.
by Jacqueline Kazil, Katharine Jarmul
An excellent introduction to Data Science in Python that is accessible for Python newbies while not being an actual Python textbook. The authors focus just as much on methods of data acquisition, selection, preparation and storytelling, as on the language and different modules (libraries). The authors use real-world data for their examples from public data bases, e.g., the WHO data repository, which makes learning a lot more fun and thrilling. The authors made data and code accessible via git.
ISBN-13: 978-1491948811
by Joel Gruz
For those already familiar with Python and those preferring a more method-centred approach, this book might be the best alternative. The author covers typical topics for data science beginners like correlation, regression and machine learning and their implementation in Python. He continuously uses the example of the fictive company datascientesta which gives this book a nice red thread.
ISBN-13: 978-1491901427
Roger Peng (Johns Hopkins Bloomberg School of Public Health), Hilary Parker (Stitch Fix) and occasional guests talk about Data Science mixed with some real-live-talk and the never-ceasing Python vs. R discussion. This (blog/website/podcast) is entertainment and education at its finest.