Month: February 2018

The Paradise Papers: How the Cloud Helped Expose the Hidden Wealth of the Global Elite

  In early 2016, the International Consortium of Investigative Journalists (ICIJ) published the Panama Papers –one of the biggest tax-related data leaks in recent history involving 2.6 Terabytes (TBs) of information. It exposed the widespread use of offshore tax havens and shell companies by thousands of wealthy individuals and political officials, including the British and Icelandic Prime Ministers. Now […]

Talend 与 Spark Submit 配置:有何区别?

在上一篇博客“Talend 和 Apache Spark:技术入门指南”中,我给大家介绍了 Talend Spark 作业如何与 Spark Submit 相当。在本篇博文中,我打算继续通过 Apache Spark Submit 来评估 Talend Spark 配置。首先我们先来了解一下如何将 Talend Spark 作业中 Apache Spark 配置选项卡下的选项映射到您可以传递到 Spark Submit 的参数,并讨论一下其用途。我们开始吧。 命令差异 在您的环境中运行 Apache Spark 作业(如 Hadoop 集群上默认提供的一个 Apache Spark 示例,用于验证 Spark 是否按预期正常工作)时,会使用以下命令: export HADOOP_CONF_DIR=XXX ./bin/spark-submit  –class org.apache.spark.examples.SparkPi –master yarn –deploy-mode client  –executor-memory 5G –num-executors 10 /path/to/examples.jar 1000 以上突出显示的两个命令用于设定我们的 Spark […]

How to Structure Your Business to Make Better Use of Data

  A few years ago, Starbucks’ director of analytics and business intelligence, Joe LaCugna, said the Seattle coffee giant once struggled to make sense of the data pouring in from its loyalty card holders, which at the time was over 13 million and comprise 36 percent of all Starbucks’ transactions. The same was true of […]

Net Neutrality: Why it’s Vital for Digital Transformation

  Until a few months ago, it was thought that the issue of net neutrality had been definitively settled by the ruling of the Federal Communications Commission (FCC) in 2015; however, that all changed with the new Trump administration and statements by the new FCC president – just reappointed for 4 years by the US […]

CIOs: Three Considerations for Digital Transformation

  Many businesses today are scrutinizing their operations to figure out how to join the digital transformation revolution. They understand that to become more competitive and customer-centric, they need processes that are flexible, integrated, insightful and scalable. They understand harnessing data and infusing business processes with it is the key to success. Unfortunately, poor data […]

Talend Step-by-Step: Continuous Data Matching & Machine Learning with Microsoft Azure

Today, almost everyone has big data, machine learning and cloud at the top of their IT “to-do” list. The importance of these technologies can’t be overemphasized as all three are opening up innovation, uncovering opportunities and optimizing businesses. Machine learning isn’t a brand new concept, simple machine learning algorithms actually date back to the 1950s, though […]


众所周知,企业数据需求不断变化,而且近期变化速度越来越快。以往于本地部署处理所有大数据的公司突然纷纷转向云计算。我们曾经一度熟悉和喜爱的框架迅速被淘汰。然而,时下争论正酣的焦点依然是,如何更快地处理数据。近期大受欢迎的两种数据处理方法是: 批处理 流处理 批处理方式多用于非连续数据。 此方法在快速处理数据集方面的表现极为出色,但并不能真正满足当今大多数企业的实时需求。流处理方式多用于连续数据,而且非常擅长将大数据转变为快速数据。 这两种方法各有其优缺点。最终,您对批处理或流处理的选择完全取决于您的业务用例。但是,在选择数据处理方法时,仍然需要考虑一些问题和使用案例。在最新一期的 Craft Beer and Data 中,Mark Balkenende 和我一起深入探讨了批处理和流处理之争。 我们回答了一些有趣的问题,比如“真的能实现实时数据吗?” 我们还就 lambda 架构是否真的已经毫无用处进行了讨论,并对决定采用批处理或流处理时应该考量的各种因素进行了梳理。 在我们进入视频(小插件)之前,最好先熟悉一下 Craft Beer and Data! 请查看我们的活动页面,欢迎参加您所在地区举行的活动。我们也很想听听您对批处理和流处理之争的看法。 欢迎在 Tweet 上 @Nick_Piette 告诉我您的想法。