Our Blog

Ongoing observations by End Point people

This is going to be a very short post about a simple solution to the problem of inserting data fast when you really have a lot of it.

The problem

For the sake of having some example to think about, imagine building an app for managing nests of ants.

You can have thousands of nests with hundreds of thousands of ants in each one of them.

To make the fun example applicable for this blog post, imagine that you’re reading data files coming from a miraculous device that “scans” nests for ants and gives you info about every ant with lots of details. This means that creation of the nest is about providing a name, coordinates, and the data file. The result should be a new nest and hundreds of thousands of ant records in the database.

How do we insert this much data without hitting the browser’s timeout?

Approaching the problem

Regular INSERT statements provide a lot of flexibility that is normally much needed, but is relatively slow. For this reason doing many of them isn’t preferred among database experts for pre-populating databases.

The solution that is typically used instead (apart from the case in which a database needs to be restored, with pg_restore having no contenders in terms of speed) is the data-loading method called COPY.

It allows you to provide data in a CSV format either from a file or “streaming” this data into the client itself. Now because it’s almost never a good idea to use the database-superuser account for connecting with the database from Rails, the first option isn’t available (access to the file system is only allowed for admins). Fortunately, there’s the second option which we are going to make use of.

The solution

Here’s a short code excerpt showing how the above mentioned approach could be used in Rails for the fun little app described in the beginning:

# first, grab the underlying connection object coming
# from the lower level postgres library:
connection = Ant.connection.raw_connection

# generate the ants array based on the data file:
ants = AntImporter.run!(data_file)

# now use the connection’s ability to start streaming
# data via the COPY feature:
connection.copy_data "COPY ants (id, type, age) FROM STDIN CSV" do
  ants.each do |ant|
    connection.put_copy_data [ ant.id, ant.type, ant.age ].to_csv
  end
end

If you’re curious, check for yourself. This way you can copy really a lot of data without having to wait for the process to finish for too long.

Please do make sure you read and considered the “caveats” section from the “COPY” page on the Postgres wiki. There are reasons for the (slower) inner workings of the INSERT statement. Use this post’s solution with care.

I got a chance to attend the annual PyCon APAC 2017 (Python Conference, Asia Pacific) which was hosted in my homeland, Malaysia. In previous years, Python conferences in Malaysia were held at the national level and this year the Malaysia’s PyCon committee worked hard on organizing a broader Asia-level regional conference.

Highlights from Day 1

The first day of the conference began with a keynote delivered by Luis Miguel Sanchez, the founder of SGX Analytics, a New York City-based data science/data strategy advisory firm. Luis shared thoughts about the advancement of artificial intelligence and machine learning in many aspects, including demonstrations of automated music generation. In his talk Luis presented his application which composed a song using his AI algorithm. He also told us a bit on the legal aspect of the music produced by his algorithm.

Luis Miguel Sanchez speaking

Luis speaking to the the audience. Photo from PyCon’s Flickr.

Then I attended Amir Othman’s talk which discussed the data mining technique of news in the Malay and German languages (he received his education at a German tertiary institution). His discussion included the verification of the source of the news and the issue of the language structure of German and Malay, which have similarities with English. First, Amir mentioned language detection using pycld2. Amir shared the backend setup for his news crawler which includes RSS and Twitter feeds for input, Redis as a message queue, and Spacy and Polyglot for the “entity recognition”.

Quite a number of speakers spoke about gensim, including Amir, who used it for “topic modelling”. Amir also used TF/IDF (term frequency–inverse document frequency) which is a numerical statistic method that is intended to reflect how significant a word is to a document in a corpus. For the similarity lookup aspect, he used word2vec on the entire corpus. In the case of full-text search he used Elasticsearch.

Later I attended Mr. Ng Swee Meng’s talk in which he shared his effort in the Sinar Project to process the government of Malaysia’s publicly available documents using his Python code. He shared the method of characterization with the use of bag of words plus the use of stopwords which has similarity with the English language. Mr. Ng’s work focuses on Malay language documents so he found out that the Indonesian’s Malay language stopwords which are already available could be used to adapt to Malay. Ng also mentioned the use of gensim in his work.

Highlights from Day 2

The second day’s talk began with a keynote from Jessica McKellar who was involved in the development of Ksplice, Zulip (co-founder), and Dropbox. She highlighted her involvement with San Quentin prison to help the convicts prepare for real-world opportunities after they get out. Jessica also mentioned diversity issues of men and women in computing, race diversity, and technical devices accessibility. Jessica mentioned that problems getting more people involved in computing is not due to lack of interest, but due to lack of access. She also praised the effort done by PyCon UK to help the visually impaired attendees attend a conference. If possible, a conference should be wheelchair friendly too.

Me standing in the audience

Me standing in the audience. Photo from PyCon’s Flickr.

I found the talk by Praveen Patil entitled “Physics and Math with Python” really interesting. Praveen showed his effort to make teaching physics and mathematics interesting for students. Apart from the code snippets he also shared the electronic gadgets which were being used for the subjects.

The other talk which I attended was delivered by Hironori Sekine on the technologies being used by startups in Japan. Hironori mentioned that Ruby is widely used by the Japanese startups and many book publications for Ruby were published in the Japanese language. Other programming languages being used include Java, PHP, Scala, and Go. Python is starting to become more popular since last year as books in the local language started to be published.

Conclusion

Overall I really appreciate the efforts of the organizer. Though it was the first ever APAC-based PyCon held in Malaysia, I felt that it was very well organized and I could not complain about anything. Thumbs up for the effort and hopefully I can attend next year’s event!

The Elastic stack is a nice toolkit for collecting, transporting, transforming, aggregating, searching, and reporting on log data from many sources. It was formerly known as the ELK stack, after its main components Elasticsearch, Logstash, and Kibana, but with the addition of Beats and other tools, the company now calls it simply the Elastic stack.

We are using it in a common configuration, on a central log server that receives logs via rsyslog over TLS, which are then stored in local files and processed further by Logstash.

When conservation is recommended

When forwarding logs on to SaaS log services such as Logentries, SumoLogic, etc., we have a limited amount of data transfer and storage allotted to us. So we need to either economize on what we send them, pay for a more expensive plan, or retain a shorter period of history.

For some very busy logs (nginx logs in JSON format) we decided to delete fields with empty values from the log event during the filter phase in Logstash. This removes a lot of data from the log message we send to the log service over the wire, and reduces the size of each log event stored in their system.

I expected this to be simple, but that expectation sometimes proves to be false. :)

Trying the prune filter

The most obvious way would be to use the Logstash prune filter, which is designed for just such a use case. However, the prune filter doesn’t handle nested keys, as explained in the documentation:

NOTE: This filter currently only support operations on top-level fields, i.e. whitelisting and blacklisting of subfields based on name or value does not work.

That is too bad.

Pruning with custom Ruby code

Several people have posted alternative solutions to this in the past. A representative recipe to have Logstash delete empty fields looked like this:

# This doesn’t work in Logstash 5 and newer ...
filter {
  ruby {
    code => "event.to_hash.delete_if {|field, value| value == '' }"
  }
}

And sadly, it doesn’t work.

Logstash 5 event API changes

It used to work with older versions of Logstash, but no longer. Logstash was originally written in Ruby, specifically JRuby for running on the JVM. But for Logstash 5 it was rewritten in Java, and though JRuby extensions are still possible, the Ruby event API has changed so that the log data is no longer provided as a mutable hash that the above code expects. (See also the Logstash event API documentation.)

Custom Ruby code to prune in Logstash 5+

So I came up with Ruby code that works using the new Logstash event API. It is more complicated , but it is still pretty straightforward:

filter {
  # remove fields with empty values
  ruby {
    code => "
      def walk_hash(parent, path, hash)
        path << parent if parent
        hash.each do |key, value|
          walk_hash(key, path, value) if value.is_a?(Hash)
          @paths << (path + [key]).map {|p| '[' + p + ']' }.join('')
        end
        path.pop
      end

      @paths = []
      walk_hash(nil, [], event.to_hash)

      @paths.each do |path|
        value = event.get(path)
        event.remove(path) if value.nil? || (value.respond_to?(:empty?) && value.empty?)
      end
    "
  }
}

We first recursively walk through the whole data structure that the API converts to a Ruby hash for us. We get all nested field names and store in an array their Logstash-style paths like "[nginx][access][upstream_addr]" that the API expects. Then we walk through the paths and use the API to check for empty values, and remove them. This way we also avoid changing the hash while still walking through it.

With that configuration and code in a file in /etc/logstash/conf.d/ (this is on CentOS 7 using the logstash RPM from Elastic) all the fields with empty values are removed.

Some other log event trimming

In addition we added a few other filters to remove or limit the size of fields that we are happy to have on our own central log server for archival or forensic purposes, but that we don’t need to send to our paid log service for the kinds of reporting we are doing there:

mutate {
  remove_field => [
    "@version", "beat", "host", "input_type", "offset", "source", "type",
    "[geoip][location]",
    "[nginx][access][pipe]",
    "[nginx][access][remote_port]",
    "[nginx][access][ssl_session_id]",
    "[nginx][access][ssl_session_reused]",
    "[nginx][access][upstream_bytes_received]",
    "[nginx][access][upstream_connect_time]",
    "[nginx][access][upstream_response_length]",
    "[nginx][access][upstream_status]"
  ]
}

if "beats_input_codec_plain_applied" in [tags] {
  mutate {
    remove_tag => ["beats_input_codec_plain_applied"]
  }
}

truncate {
  length_bytes => 1024
}

For example, sometimes the client sends an absurdly long HTTP Referer request header, or the URI requested is very long — we see plenty longer than 5000 characters. We are happy to truncate those to save space.

We also do not need to waste space in our paid log service with the repetitive tag beats_input_codec_plain_applied or the same Filebeat version in every single log event.

Finis

This is working for us on Logstash 5.6.3, but should work on Logstash 5.0 and newer.

Archive