Over the past two weeks, there’s been a growing community engagements in the Australian Innovation ecosystem. This specific one that I’m referring to is a … “slack channel was set up and is co-moderated by Dianna Sommerville, founder of the Regional Pitchfest and Community Manager for Bridge Hub. My (Chad Renando) interest is based on my various roles as director of Startup Status, Managing Director Australia with the Global Entrepreneurship network, ESHIP Champion with the Kauffman Foundation, and working with QUT’s Australian Centre for Entrepreneurship research and the Rural Economies Centre of Excellence at USQ.”
Being engaged from a few different angles, I’ve been working on the data itself and this is a story about that the data.
What is the problem?
Well. The criteria was what we are looking into the data for is the effectiveness of the community to collaborate, connect and as Chad and I on a recent episode on youtube – “the increase in velocity” or “the reduction in friction”. We see this in different ways that we work and communicate as a community or ecosystem. The requirements that drives this is more about about social network analysis which is beyond the scope of this article.
However if you are interested in this side of the discussion, have a read about some of this at Dr Johanna Nalau’s blog https://johannanalau.com/blog/ who is helping us into this space.
What is the platform?
Off the back if my previous article The tech behind the social good where I explored using the Oracle Data Platform for working with the data, I started here again to look at what I can do with the data.
However, I went with a different approach. Previously, I could see the data very easily with my understanding of the data. That meant, that I didn’t need to spend time building the data model before getting started with the analysis. This time round, I needed to do a little more and hence I focused on two parts:
- Building the logic to capture the data. And to do this I worked with the Slack APIs.
- Building the model that can be used for the analysis. And for this bit, I primarily worked with Pandas however supported with my work with Spark and Analytics Desktop.
I’ll explain a little more about the process and the tools.
Working with the Slack APIs
This community was started as a slack workspace to connect innovation ecosystem builders and leaders across Australia. Chad and I have had discussions about data behind communities like this where the value (of the community) is not the direct outputs or outcomes of the individuals but how the community supports these outcomes directly and indirectly. Researching on the analytics tools, most of them were about supplying 3rd party analytics into Slack as notifications or messages, but not the other way around. So, this was not just an opportunity to support the community, but also an opportunity to learning about adapting analytics to Slack.
Slack has a REST API with a couple of different client SDKs – one in Python and one in NodeJS. In addition, there are three different methods to communicate with Slack 1/ web-based (where you invoke the REST end-point via HTTPs), 2/ event-based (where you register an end-point to receive events) and 3/ Web-Socket-based (where you open the socket and you invoke and receive events). I also got the feeling that the Web-Socket one was being replaced by the web-based and event-based due to its more open nature. Based upon the versioning and the types of interactions, I settled on #1 with the web-based APIs as well as the Python implementation to correlate working with Apache Spark and Pandas libraries.
The APIs are well documented in the Slack API site given good information about the end-point, OAuth security model and rate limiting based upon a tiered structure. I would suggest creating your own tests as the documentation doesn’t really explain the data model especially when you are looking to capture information from Slack.
Through this process, I was able to capture information about channels, users, files, messages (ie messages directed at channels), threads (ie messages replying to other messages) and reactions.
The value of learning drives a different approach to the tools
Because we didn’t have a specific model to go with, this process was very more experiment. This has been going for about 2 weeks (since about 20th March) and took about a week to settle on what this model looked like. Up to this time, we were iterating over this. To simplify the process (as well as data transfer), we settled on a basic CSV format. And from there for each of the basic entities – channels, users, messages, threads, reactions, I imported them into Oracle Analytics Desktop to play around with the data. With a quick way of understanding the data, it was good way of communicating what the data talks about with others including Chad. And with the speed of implementation, I was able to demonstrate what we could do.
That being said, automation was something that I needed to consider, running the reports, as new data was captured – it was important to consider the time, effort and the reliability of the process to present a continual refresh of the data to Chad and the broader team. Oracle Analytics Desktop was great to get things moving and there are jobs that can be automated and sequenced however in this scenario, I needed a full end-to-end automated process. At least by the time I needed to automate and build code to transform the data, I had a very good understanding about what I needed to build.
Tools – Spark, Panda, pure Python or even something else like jsonb
With Python being the core language I used for this work, I had several different options at ready access and something on my learning path to do more of – Apache Spark, Pandas and Python (itself).
Each of these packages, frameworks / products can accomplish similar things especially for data engineering. And for the size of data, the fact that I’m working primarily with structured data sets – then it was much of a muchness. And furthermore, the fact that I had this automated then whether it took 30 seconds or 2 minutes was not a major concern – it was automated.
What I did find was that – especially comparing Pandas and Spark – the data manipulation was very similar. I recognise that the underlying engineering of the Spark platform to handle horizontal scale to process large amounts of data across a grid is evident, however the scenario that I had was not the case. Most of the data sets was less than a few MBs. In addition, Spark to handle the different data sources and having the ability to integrate Streams directly into the pipelines is a plus. (so … it all depends …)
Through the experimentation and automation, we now have daily updates and extracts with the dataset and automated aggregations being generated.
New opportunities
After having a quick discussion with a friend Peter Laurie, we discussed about some of my earlier tech heritages. Things that are more JSON or XML / XQuery focused was something that I hadn’t consider though presented new opportunities. Things like jsonb and postgres opened up the opportunity to direct translate from the JSON format to a database. (Something to add to my bucket list).
There’s also extensions into the way we look at the data and a couple of areas to explore.
- Graph and network analysis. An important characteristic of communities is the word of mouth or collisions. This is evident through the network of people and the connections we have. The value is driven primarily through the people in the community to support and push ideas through to execution. We can then start mapping out the relationships of people (across the threads but into how the real world occurs – these collisions that we (at some stage determine valuable through the experiences we bring as a community).
- Text analysis. Another important aspect to the community, is what we are communicating about. And that is captured primarily in the messages. So, understanding the intent, context and what drives value through the community, the text is a source of information to analyse (though not the only way to gain context and intent). We’ll be looking into this to see what we can see. I’ve already started that process in Oracle Analytics Desktop.
Where to from here
If you are interested in the impact that this (code) is making, check out this youtube clip of Chad and myself talking through the “why” behind the doing with the data – https://www.youtube.com/watch?v=Daj6SXeSRYA.
If you want to read more about this community, have a look here https://www.linkedin.com/pulse/slack-community-australian-innovation-ecosystem-builders-chad-renando/, or get involved and engaged reach out to me (@jlowe000).
The code that I’ve used to build the Slack APIs, extract the data and create analysis is available in my git repository (however at the time of writing this it is still private). I’m happy to share what I have, see reach out to me and we can start the discussions.
(Update – I’ve made the repo public – https://github.com/jlowe000/ausinnovation)